farshidbalan farshidbalan - 8 months ago 49
JSON Question

Loading more links in a page after sending json requests in Python

I am parsing this URL to get links from one of the boxes with infinite scroll. Here is mo code for sending the requests for the website to get next 10 links:

import requests
from bs4 import BeautifulSoup
import urllib2
import urllib
import extraction
import json
from json2html import *

baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
parameters2 = {
'topic':' ',
html2 = requests.get(baseUrl, params = parameters2)
html3 = json.loads(html2.text) # array of size 10

In the corresponding HTML , there is an element like:

<li class="loading">Loading more headlines...</li>

that tells there are more items to be loaded by scrolling dowwn , but I don't know how to use json file to write a loop to gets more links.
My first try was to use Beautiful Soup and to write the following code to get links and ids :

url = 'http://www.marketwatch.com/investing/stock/xom'
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
pressReleaseBox = soup.find('div', attrs={'id':'prheadlines'})

and then check if there is more link to scrape, get the next json file:

loadingMore = pressReleaseBox.find('li',attrs={'class':'loading'})
while loadingMore != None:
# get the links from json file and load more links

I don't know hot to implement the comment part. do you have any idea about it?
I am not obliged to use BeautifulSoup, and any other working library will be fine.

Answer Source

Here is how you can load more json file:

  1. get last json file, extract value of key UniqueId in last item.
    1. if the value is something looks like e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2:8499
      1. extract e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2 as sequence
      2. extract 8499 as messageNumber
      3. let docId be empty
    2. if the value is something looks like 1222712881
      1. let sequence be empty
      2. let messageNumber be empty
      3. extract 1222712881 as docId
  2. put parameters sequence, messageNumber, docId into your parameters2.
  3. use requests.get(baseUrl, params = parameters2) to get your next json file.