Chris Chris - 1 month ago 9
Python Question

Building a nested list with XPath-extracted XML document structure

I am trying to get the text (using xpath) of all

<h2>
tags in:

<div id="static_id">
<div>...
<a ...>
<div>...
<h2>Text 1</h2>
<a ...>
<div>...
<div>...
<span>...
<h2>Text 2</h2>
<a ...>
<span>...
<h2>Text 3</h2>

<div id="static_id">
<div>...
<span>...
<h2>Text A</h2>
<a ...>
<div>...
<p>...
<div>...
<h2>Text B</h2>
<a ...>
<h2>Text C</h2>
[...]


In my html source code there are
<div>'s
with the id
static_id
. Within these div's there is just one
<h2>
tag and I want to get it's content. In the end I would like to have a list that looks like this:

lst = [["Text 1", "Text 2", "Text 3"], ["Text A", "Text B", "Text C"]]


Please notice that it's a list of lists (every h2-text from one
<div id="static_id">
should end up in a seperate list like in the example above.

Is there an easy way to achive this?

I thought I count all
static_id
divs and iterate over all
<h2>
tags to achive this. My approach:

list_all = []
div_amount = len(tree.xpath('//div[@id="static_id"]')) # 2 elements in this case (works)
for d in range(1, div_amount+1) # 1,2
h2_count = len(tree.xpath('//div[@class="static_id"]['+str(d)+']//h2')) #count h2
lst = []
for i in range(1, h2_count+1) #1,2,3
h2_text = ''.join(tree.xpath('//div[@id="static_id"]['+str(d)+']//h2['+i+']/text()'))
lst.append(h2_text)
list_all.append(lst)


Line 2: Counts all id="static_id"

Line 3: Loop over all id="static_id"

Line 4: Count all h2 (unfortunately all h2's from the html source are counted)

Line 5: Loop over all h2's

Line 6: Get h2'text and next save in list

Can anyone help me out please? I feel like this could be done easier but I don't know how.

Answer

Easily made a one-liner:

list_all = [ static_id_div.xpath('.//h2/text()')
             for static_id_div in tree.xpath('//div[@id="static_id"]') ]

The important difference here is that the inner query is being run against the elements returned by the outer query, rather than making them work starting from the root of the document.