flybonzai flybonzai - 1 year ago 190
Python Question

Spark writes out `saveAsTextFile` in a Row() format

I'm trying to copy these files over from S3 to Redshift, and they are all in the format of Row(column1=value, column2=value,...), which obviously causes issues. How do I get a dataframe to write out in normal csv?

I'm calling it like this:

# final_data.rdd.saveAsTextFile(
# path=r's3n://inst-analytics-staging-us-standard/spark/output',
# compressionCodecClass=''
# )

I've also tried writing out with the
module, and it seems like it ignores any of the computations I did, and just formats the original parquet file as a csv and dumps it out.

I'm calling that like this:


Answer Source

The spark-csv approach is a good one and should be working. It seems by looking at your code that you are calling df.write on the original DataFrame df and that's why it's ignoring your transformations. To work properly, maybe you should do:

final_data = # Do your logic on df and return a new DataFrame