seanv507 seanv507 - 2 months ago 47
JSON Question

pandas - reading multiple JSON records into dataframe

I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.

import json
import pandas as pd

test='''{"a":1,"b":2}
{"a":3,"b":4}'''
#df=pd.read_json(test,orient='records') doesn't work, expects []

l=[ json.loads(l) for l in test.splitlines()]
df=pd.DataFrame(l)

Answer

It's going to depend on the size of you DataFrames which is faster, but another option is to use join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:

In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'

For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...

In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop

In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop

In [23]: test_100 = '\n'.join([test] * 100)

In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop

In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop

In [26]: test_1000 = '\n'.join([test] * 1000)

In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop

In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop

Note: of that time the join is surprisingly fast.