Siraj S. Siraj S. - 1 year ago 172
Python Question

parse html tables with lxml

I have been trying to parse the table contents from here
i have tried a couple of alternatives, like


here is my last code:

import requests, lxml.html
url =''
url = requests.get(url)
html = lxml.html.fromstring(url.content)
packages = html.xpath('//div[@id="replacetext"]/table/tbody//tr/td/a//text()') # get the text inside all "<tr><td><a ...>text</a></td></tr>"

however none of the alternatives seems to be working. In the past, i have scraped data with similar code (although not from this url!). Any guidance will be really helpful.

Answer Source

I tried you code. The problem is not caused by lxml. It is caused by how you load the webpage.

I know that you use the requests to get the content of webpage, however, the content you get from requests may be different from the content you see in the browser.

In this page, '', print the content of request.get, you will find that the source code of this page contains no table!!! The table is loaded by ajax query.

So find a way to load the 'really' page you want, the you can use 'lxml`.

By the way, in web scraping, there are also something you need to mention, for example, request headers. It's a good practice to set your request headers when you do the http request. Some sites may block you, if you do not provide a reasonable User-Agent in the header. Though there is nothing to do with your current problem.


Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download