johan855 johan855 - 6 days ago 6
JSON Question

JSON Line issue when loading from import.io using Python

I'm having a hard time trying to load an API response from import.io into a file or a list.

The enpoint I'm using is

https://data.import.io/extractor/{0}/json/latest?_apikey={1}


Previously all my scripts were set to use normal JSON and all was working well, but now hey have decided to use json line, but somehow it seems malformed.

The way I tried to adapt my scripts is to read the API response in the following way:

url_call = 'https://data.import.io/extractor/{0}/json/latest?_apikey={1}'.format(extractors_row_dict['id'], auth_key)
r = requests.get(url_call)

with open(temporary_json_file_path, 'w') as outfile:
json.dump(r.content, outfile)

data = []
with open(temporary_json_file_path) as f:
for line in f:
data.append(json.loads(line))


the problem doing this is that when I check data[0], all of the json file content was dumped in it...

data[1] = IndexError: list index out of range


Here is an example of
data[0][:300]
:

u'{"url":"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de","result":{"extractorData":{"url":"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de","resourceId":"23455234","data":[{"group":[{"Brand":[{"text":"Brand","href":"https://www.example.com'


Does anyone have experience with the response of this API?
All other jsonline reads I do from other sources work fine except this one.

EDIT based on comment:

print repr(open(temporary_json_file_path).read(300))


gives this:

'"{\\"url\\":\\"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de\\",\\"result\\":{\\"extractorData\\":{\\"url\\":\\"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de\\",\\"resourceId\\":\\"df8de15cede2e96fce5fe7e77180e848\\",\\"data\\":[{\\"group\\":[{\\"Brand\\":[{\\"text\\":\\"Bra'

Answer

The API gave you double-encoded data. Something pushed JSON data into the service, and the service then encoded that data again to a JSON string.

You'd have to decode it again:

with open(temporary_json_file_path) as f:
    for line in f:
        decoded = json.loads(line)
        # attempt to decode again; if it fails there was no double encoding
        try:
            decoded = json.loads(decoded)
        except TypeError:
            pass
        data.append(decoded)

It would be much, much better if this was fixed at the source however. I'm not sure how import.io extractor sets are built; this could be a bug in their code, or a bug in whatever is responsible for scraping sites.

Comments