Amon Amon - 2 months ago 17
HTML Question

Receiving and empty list when trying to make a webscraper to parse websites for links

I was reading this website and learning how to make a webscraper with

lxml
and `Requests. This is the webscraper code:

from lxml import html
import requests

web_page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')

tree = html.fromstring(web_page.content)


buyers = tree.xpath('//div[@title="buyer-name"]/text()')
prices = tree.xpath('//span[@class="item-price"]/text()')

print "These are the buyers: ", buyers
print "And these are the prices: ", prices


It works as intended, but when I try to scrape https://www.reddit.com/r/cringe/ for all the links I'm simply getting
[]
as a result:

#this code will scrape a Reddit page

from lxml import html
import requests


web_page = requests.get("https://www.reddit.com/r/cringe/")

tree = html.fromstring(web_page.content)

links = tree.xpath('//div[@class="data-url"]/text()')

print links


What's the problem with the xpath I'm using? I can't figure out what to put in the square brackets in the xpath

Answer

First off, your xpath is wrong, there are no classes with data-url, it is an attribute so you would want div[@data-url] and to extract the attribute you would use /@data-url:

from lxml import html
import requests

headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.92 Safari/537.36"}
web_page = requests.get("https://www.reddit.com/r/cringe/", headers=headers)

tree = html.fromstring(web_page.content)

links = tree.xpath('//div[@data-url]/@data-url')

print links

Also you may see html like the following returned if you query too often or don't use a user-agent so respect what they recommend:

<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>

<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>

<p>please wait 6 second(s) and try again.</p>

    <p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>
  </body>
</html>

If you plan on scraping a lot of reddit, you may want to look at PRAW and w3schools has a nice introduction to xpath expressions.

To break it down:

//div[@data-url]

searches the doc for div's that have an attribute data-url we don't care what the attribute value is, we just want the div.

That just finds the div's, if you removed the /@data-url you would end up with a list of elements like:

[<Element div at 0x7fbb27a9a940>, <Element div at 0x7fbb27a9a8e8>,..

/@data-url actually extracts the attrbute value i.e the hrefs.

Also you just wanted specific links, the youtube links you could filter using contains:

'//div[contains(@data-url, "www.youtube.com")]/@data-url'

contains(@data-url, "www.youtube.com") will check if the data-url attribute values contain www.youtube.com so the output will be a list of the youtube links.