eleanora eleanora - 1 year ago 162
Python Question

How to copy and convert parquet files to csv

I have access to a hdfs file system and can see parquet files with

hadoop fs -ls /user/foo


How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.

Answer Source

Try

df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")

Relevant API documentation:

Both /path/to/infile.parquet and /path/to/outfile.csv should be locations on the hdfs filesystem. You can specify hdfs://... explicitly or you can omit it as usually it is the default scheme.

You should avoid using file://..., because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:

hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv

Or display it directly from HDFS:

hdfs dfs -cat /path/to/outfile.csv
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download