foosion foosion - 28 days ago 12
Python Question

lxml separates elements while beautifulsoup does not

lxml returns two items, while beautifulsoup returns only one. Is that because the

<br/>
shouldn't be there and beautifulsoup is more tolerant of bad html?

Is there a better way to extract the location using lxml? The
<br/>
isn't always there.

from lxml import html
from bs4 import BeautifulSoup as bs

s = '''<td class="location">
<p>
TRACY,<br/>&nbsp;CA&nbsp;95304&nbsp;
</p></td>
'''

tree = html.fromstring(s)
r = tree.xpath('//td[@class="location"]/p/text()')
print r

soup = bs(s, 'lxml')
r = soup.find_all('td', class_='location')[0].get_text()
print r

Answer

Is there a better way to extract the location using lxml? The <br/> isn't always there.

If, by better you mean returning a result that is closer to its BS counterpart, then XPath expression that better resemble your BS code would be :

>>> print tree.xpath('string(//td[@class="location"])')


    TRACY, CA 95304 
Comments