user3725021 user3725021 - 11 days ago 6
JSON Question

Parse json file from website with python

I am looking to parse and save the contents of json file which is embedded in the html code. However when I isolate the relevant string and try and load it with

json
package I receive an error
JSONDecodeError: Extra data
and I am unsure what is causing this.

It was suggested that the relevant code actually could contain multiple dictionaries and this might be problematic, but I'm not clear on how to proceed if this is true. My code is provided below. Any suggestions much appreciated!

from bs4 import BeautifulSoup
import urllib.request
from urllib.request import HTTPError
import csv
import json
import re

def left(s, amount):
return s[:amount]

def right(s, amount):
return s[-amount:]

def mid(s, offset, amount):
return s[offset:offset+amount]
url= "url"
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
s = urlopen(req,timeout=200).read()
except urllib.request.HTTPError as e:
print(str(e))
soup = BeautifulSoup(s, "lxml")
tables=soup.find_all("script")
for i in range(0,len(tables)):
if str(tables[i]).find("TimeLine.init")>-1:
dat=str(tables[i]).splitlines()
for tbl in dat:
if str(tbl).find("TimeLine.init")>-1:
s=str(tbl).strip()
j=json.loads(s)

Answer

You could use JSON's own exception reporting to help with parsing which gives the location of where the loads() failed, for example:

Extra data: line 1 column 1977 (char 1976)

The following script first locates the all the javascript <script> tags and looks for the function inside each. It then finds the outer start and end of the JSON text. With this it then attempts to decode it, notes the failing offset, skips this character and tries again. When the final block is found, it will decode succesfully. It then calls loads() on each valid block, storing the results in json_decoded:

from bs4 import BeautifulSoup
from urllib.request import HTTPError, Request, urlopen
import csv
import json
import re


url = "http://live-footy.heraldsun.com.au/FieldView/Index/20120120120140101"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

try:
    s = urlopen(req, timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  

json_decoded = []
soup = BeautifulSoup(s, "lxml")

for script in soup.find_all("script", attrs={"type" : "text/javascript"}):
    text = script.text
    search = 'FieldView.TimeLine.init('
    field_start = text.find(search)

    if field_start != -1:
        # Find the start and end of the JSON in the function
        json_offsets = []
        json_start = field_start + len(search)
        json_end = text.rfind('}', 0, text.find(');', json_start)) + 1

        # Extract JSON
        json_text = text[json_start : json_end]

        # Attempt to decode, and record the offsets of where the decode fails
        offset = 0

        while True:
            try:
                dat = json.loads(json_text[offset:])
                break
            except json.decoder.JSONDecodeError as e:
                # Extract failed location from the exception report
                failed_at = int(re.search(r'char\s*(\d+)', str(e)).group(1))
                offset = offset + failed_at + 1
                json_offsets.append(offset)

        # Extract each valid block and decode it to a list
        cur_offset = 0

        for offset in json_offsets:
            json_block = json_text[cur_offset : offset - 1]
            json_decoded.append(json.loads(json_block))
            cur_offset = offset

print(json_decoded)

This results in json_decoded holding two JSON entries.

Comments