boldbrandywine boldbrandywine - 29 days ago 7
HTML Question

Scrapy returning empty list for xpath

I am using Scrapy to get abstracts from openreview urls. For example, I want to get the abstract from http://openreview.net/forum?id=Bk0FWVcgx, and upon doing

$ scrapy shell "http://openreview.net/forum?id=Bk0FWVcgx"
$ response.xpath('//span[@class="note_content_value"]').extract()


I get back
[]
. In addition, when I do
view(response)
I am lead to a blank site
file:///var/folders/1j/_gkykr316td7f26fv1775c3w0000gn/T/tmpBehKh8.html
.

Further, inspecting the openreview webpage shows me there are script elements, which I've never seen before. When I call

response.xpath(//script).extract()
I get things back like
u'<script src="static/libs/search.js"></script>'
for example.

I've read a little bit about this having something to do with javascript, but I'm kind of a beginner with Scrapy and unsure how to bypass this and get what I want.

Answer

I found that page use JavaScript/AJAX to load all information from address
http://openreview.net/notes?forum=Bk0FWVcgx&trash=true

But it needs two cookies to get access to this inforamtion. First server sends cookie GCLB later page load http://openreview.net/token and get second cookie openreview:sid. After that page can load JSON data.

It is working example with requests

import requests

s = requests.Session()

# to get `GCLB` cookie
r = s.get('http://openreview.net/forum?id=Bk0FWVcgx')
print(r.cookies)

# to get `openreview:sid` cookie
r = s.get('http://openreview.net/token')
print(r.cookies)

# to get JSON data
r = s.get('http://openreview.net/notes?forum=Bk0FWVcgx&trash=true')
print(r.json())

Other solution: use Selenium or other tool to run JavaScript code and then you can get full HTML with all information. Scrapy probably can use Seleniu or PhantomJS to run JavaScript. But I newer try it with Scrapy.

Comments