I'm relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contains a list of dictionaries). The resulting RDD would then, roughly speaking, contain all of the lists of dictionaries combined into a single list of dictionaries. I haven't been able to find this in the documentation (https://spark.apache.org/docs/1.2.0/api/python/pyspark.html), but if I missed it please let me know.
So far I tried reading the JSON files and creating the combined list in Python, then using sc.parallelize(), however the entire dataset is too large to fit in memory so this is not a practical solution. It seems like Spark would have a smart way of handling this use case, but I'm not aware of it.
How can I create a single RDD in Python comprising the lists in all of the JSON files?
I should also mention that I do not want to use Spark SQL. I'd like to use functions like map, filter, etc., if that's possible.
Following what tgpfeiffer mentioned in their answer and comment, here's what I did.
First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:
my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files) my_RDD_dictionaries = my_RDD_strings.map(json.loads)
If there's a better or more efficient way to do this, please let me know, but this seems to work.