user753720 user753720 - 8 months ago 72
JSON Question

How to load directory of JSON files into Apache Spark in Python

I'm relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contains a list of dictionaries). The resulting RDD would then, roughly speaking, contain all of the lists of dictionaries combined into a single list of dictionaries. I haven't been able to find this in the documentation (, but if I missed it please let me know.

So far I tried reading the JSON files and creating the combined list in Python, then using sc.parallelize(), however the entire dataset is too large to fit in memory so this is not a practical solution. It seems like Spark would have a smart way of handling this use case, but I'm not aware of it.

How can I create a single RDD in Python comprising the lists in all of the JSON files?

I should also mention that I do not want to use Spark SQL. I'd like to use functions like map, filter, etc., if that's possible.


Following what tgpfeiffer mentioned in their answer and comment, here's what I did.

First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:

my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries =

If there's a better or more efficient way to do this, please let me know, but this seems to work.