Dinosaurius Dinosaurius -4 years ago 83
Java Question

How to load parquet files from hadoopish folder

If I save data frame this way in Java, ...:

df.write().parquet("myTest.parquet");


..., then it gets saved in a hadoopish way (a folder with numerous files).

Is it possible to save data frame as a single file? I tried
collect()
, but it does not help.

If it's impossible, then my question is how should I change the Python code for reading Parquet files from hadoopish folder created by
df.write().parquet("myTest.parquet")
:

load_df = sqlContext.read.parquet("myTest.parquet").where('field1="aaa"').select('field2', 'field3').coalesce(64)

Answer Source

Spark writes your files in a directory, this files in numerous as you say and if the writing operation success it saves another empty file called _SUCCESS

I'm coming from scala but I do believe that there's a similar way in python

Save and read your files in parquet or json or whatever format you want is straightforward :

df.write.parquet("path")
loaddf = spark.read.parquet("path")

I tried collect(), but it does not help.

Talking about collect, it is not a good practice to use it in such operations because it returns your data to driver so you will lose the parallel computation benefits, and it will cause an OutOfMemoryException if the data can't fit in the memory

Is it possible to save data frame as a single file?

You really don't need to do that in major cases, if so, use the repartition(1) method on your Dataframe before saving it

Hope it helps, Best Regards

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download