If I save data frame this way in Java, ...:
load_df = sqlContext.read.parquet("myTest.parquet").where('field1="aaa"').select('field2', 'field3').coalesce(64)
Spark writes your files in a directory, this files in numerous as you say and if the writing operation success it saves another empty file called
I'm coming from scala but I do believe that there's a similar way in python
Save and read your files in
json or whatever format you want is straightforward :
df.write.parquet("path") loaddf = spark.read.parquet("path")
I tried collect(), but it does not help.
collect, it is not a good practice to use it in such operations because it returns your data to driver so you will lose the parallel computation benefits, and it will cause an
OutOfMemoryException if the data can't fit in the memory
Is it possible to save data frame as a single file?
You really don't need to do that in major cases, if so, use the
repartition(1) method on your
Dataframe before saving it
Hope it helps, Best Regards