AutomaticStatic AutomaticStatic - 1 month ago 4
Python Question

Retrieving tail text from html

Python 2.7 using lxml

I have some annoyingly formed html that looks like this:

"123 Main st.
"New York
"101 California St.
"San Francisco

So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.

So far what I've done is gotten a list of nodes with names using
. So let's assume I'm currently on the
node for John.

I'm trying to get
for everything following the current node but preceding the next
node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an
operator in an expression that has no
brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?


This should work:

from lxml import etree

p = etree.HTMLParser()
html = open(r'./test.html','r')
data =
tree = etree.fromstring(data, p)

my_dict = {}

for b in tree.iter('b'):
    br = b.getnext().tail.replace('\n', '')
    my_dict[b.text.replace('\n', '')] = br

print my_dict

This code prints:

{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}

(You may want to strip the quotation marks out!)

Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.