Agape Gal'lo Agape Gal'lo - 1 month ago 14
JSON Question

MemoryError when loading a JSON file

Python (and spyder) return a MemoryError when I load a JSON file which is 500Mo large.

But my computer have a 32Go RAM and the "memory" displayed by spyder go from 15% to 19% when I try to load it! It seems that I sould have much more space...

Something I didn't think of?

Answer

500MB of JSON data does not result in 500MB of memory usage. It will result in a multiple of that. Exactly by what factor depends on the data, but a factor of 10 - 25 is not uncommon.

For example, the following simple JSON string of 14 characters (bytes on disk) results in a Python object is almost 25 times larger (Python 3.6b3):

>>> import json
>>> from sys import getsizeof
>>> j = '{"foo": "bar"}'
>>> len(j)
14
>>> p = json.loads(j)
>>> getsizeof(p) + sum(getsizeof(k) + getsizeof(v) for k, v in p.items())
344
>>> 344 / 14
24.571428571428573

That's because Python objects require some overhead; instances track the number of references to them, what type they are, and their attributes (if the type supports attributes) or their contents (in the case of containers).

If you are using the json built-in library to load that file, it'll have to build larger and larger objects from the contents as they are parsed, and at some point your OS will refuse to provide more memory. That won't be at 32GB, because there is a limit per process how much memory can be used, so more likely to be at 4GB. At that point all those objects already created are freed again, so in the end the actual memory use doesn't have to have changed that much.

The solution is to either break up that large JSON file into smaller subsets, or to use an event-driven JSON parser like ijson.

An event-driven JSON parser doesn't create Python objects for the whole file, only for the currently parsed item, and notifies your code for each item it created with an event (like 'starting an array, here is a string, now starting a mapping, this is the end of the mapping, etc.). You can then decide what data you need and keep, and what to ignore. Anything you ignore is discarded again and memory use is kept low.