boldbrandywine boldbrandywine - 10 months ago 70
HTML Question

Scrapy returning empty list for xpath

I am using Scrapy to get abstracts from openreview urls. For example, I want to get the abstract from, and upon doing

$ scrapy shell ""
$ response.xpath('//span[@class="note_content_value"]').extract()

I get back
. In addition, when I do
I am lead to a blank site

Further, inspecting the openreview webpage shows me there are script elements, which I've never seen before. When I call

I get things back like
u'<script src="static/libs/search.js"></script>'
for example.

I've read a little bit about this having something to do with javascript, but I'm kind of a beginner with Scrapy and unsure how to bypass this and get what I want.

Answer Source

I found that page use JavaScript/AJAX to load all information from address

But it needs two cookies to get access to this inforamtion. First server sends cookie GCLB later page load and get second cookie openreview:sid. After that page can load JSON data.

It is working example with requests

import requests

s = requests.Session()

# to get `GCLB` cookie
r = s.get('')

# to get `openreview:sid` cookie
r = s.get('')

# to get JSON data
r = s.get('')

Other solution: use Selenium or other tool to run JavaScript code and then you can get full HTML with all information. Scrapy probably can use Seleniu or PhantomJS to run JavaScript. But I newer try it with Scrapy.