Catalin Besleaga Catalin Besleaga - 7 months ago 20
HTML Question

XPATH - html with a lot of children

Consider the html in the page variable.

How do I access the tds ?

I want to access them like

xpath("/table/tr/td/text())"


I don't want to indicate the other trs

Unfortunately this expression
xpath('.//table/tr/tr/tr/td/text()')
doesn't work either.

Python code:

import __future__
from lxml import html
import requests
from bs4 import BeautifulSoup

page = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>cv</title>
</head>
<body>

<table>
<tr>
<tr>
<tr>
<td>table1 td1</td>
<td>table1 td2</td>
</tr>
</tr>
</tr>
</table>

<table>
<tr>
<tr>
<tr>
<td>table2 td1</td>
<td>table2 td2</td>
</tr>
</tr>
</tr>
</table>

<table>
<tr>
<tr>
<tr>
<td>table3 td1</td>
<td>table3 td2</td>
</tr>
</tr>
</tr>
</table>
</body>
</html>
"""

soup = str(BeautifulSoup(page, 'html.parser'))
tree = html.fromstring(soup)

things = tree.xpath('.//table/tr/tr/tr/td/text()')

print(things)

for thing in things:
print(thing)

print('That's all')


I want it from the root!

Answer

Use xpath //td/text():

things = tree.xpath('//td/text()')

The //td stands for "find any td element in any depth.

Works for me.

Printing td elements grouped per table:

doc = html.fromstring(page)
for table_elm in doc.xpath("//table"):
    print "another table"
    things = table_elm.xpath('.//td/text()')
    print(things)

Note, that in this case is the . in xpath significant.

Comments