Anand Anand - 1 month ago 20
Python Question

Fetch News Data from right Scrollbar using Beautifulsoup

I am using the following webpage https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ
to get the data from the right hand side scroller.

I have attached the screen shot where there is a red arrow marking the segment.

enter image description here

I have used the following code:

def parse():
mainPage = urllib2.urlopen("https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ")
lSoupPage = BeautifulSoup(mainPage)

for index in lSoupPage.findAll("div", {"class" : "jfk-scrollbar"}):
for item in index.findAll("div", {"class" : "news-item"}):
print item.a.text.strip()


I am not able to fetch the news-url by doing this. Please help.

Answer

The sidebar is loaded over AJAX and is not part of the page itself.

The page has a content id:

cid = lSoupPage.find('link', rel='canonical')['href'].rpartition('=')[-1]

use this to get the news data:

newsdata = urllib2.urlopen('https://www.google.com/finance/kd?output=json&keydevs=1&recnews=0&cid=' + cid)

Unfortunately, the data returned is not valid JSON; the keys are not using quotes. It is valid ECMAScript, just not valid JSON.

You can either 'repair' this by using a regular expression, or use a lenient parser that accepts ECMAscript object notation.

The latter can be done with the external demjson library:

>>> import demjson
>>> r = requests.get(
>>> data = demjson.decode(r.content)
>>> data.keys()
[u'clusters', u'result_total_articles', u'results_per_page', u'result_end_num', u'result_start_num']
>>> data['clusters'][0]['a'][0]['t']
u'Former Ford Motor Co. CEO joins Google board'

Repairing with a regular expression:

import re
import json

repaired_data = re.sub(r'(?<={|,)\s*(\w+)(?=:)', r'"\1"', broken_data)
data = json.loads(repaired_data)