measureallthethings measureallthethings - 10 months ago 43
JSON Question

Fastest Method To Read Thousands of JSON Files in Python

I have a number of JSON files I need to analyze. I am using iPython (

Python 3.5.2 | IPython 5.0.0
), reading in the files to a dictionary and appending each dictionary to a list.

My main bottleneck is reading in the files. Some files are smaller, and are read quickly, but the larger files are slowing me down.

Here is some example code (sorry, I cannot provide the actual data files):

import json
import glob

def read_json_files(path_to_file):
with open(path_to_file) as p:
data = json.load(p)
return data

def giant_list(json_files):
data_list = []
for f in json_files:
return data_list

support_files = glob.glob('/Users/path/to/support_tickets_*.json')
small_file_test = giant_list(support_files)

event_files = glob.glob('/Users/path/to/google_analytics_data_*.json')
large_file_test = giant_list(event_files)

The support tickets are very small in size--largest I've seen is 6KB. So, this code runs pretty fast:

In [3]: len(support_files)
Out[3]: 5278

In [5]: %timeit giant_list(support_files)
1 loop, best of 3: 557 ms per loop

But larger files definitely are slowing me down...these event files can reach ~2.5MB each:

In [7]: len(event_files) # there will be a lot more of these soon :-/
Out[7]: 397

In [8]: %timeit giant_list(event_files)
1 loop, best of 3: 14.2 s per loop

I've researched how to speed up the process and came across this post, however, when using UltraJSON the timing was just slightly worse:

In [3]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

SimpleJSON did not do much better:

In [4]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

Any tips on how to optimize this code and more efficiently read a lot of JSON files into Python is much appreciated.

Finally, this post is the closest I've found to my question, but deals with one giant JSON file, not many smaller sized ones.

Answer Source

Use list comprehension to avoid resizing list multiple times.

def giant_list(json_files):
    return [read_json_file(path) for path in json_files]

You are closing file object twice, simply do it once (on exiting with file would be closed automatically)

def read_json_file(path_to_file):
    with open(path_to_file) as p:
        return json.load(p)

At the end of the day, your problem is I/O bound, but these changes will help a little bit. Also, I have to ask - do you really have to have all these dictionaries in the memory at the same time?