duckertito duckertito - 20 days ago 9
Scala Question

How to read and write DataFrame from Spark

I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:

To read the file if it exists:

df = sqlContext
.read.parquet("s3n://bucket/myTest.parquet")
.toDF("key", "value", "date", "qty")


To write the file:

df.write.parquet("s3n://bucket/myTest.parquet")


This does not work because:

1)
write
creates the folder
myTest.parquet
with hadoopish files that later I cannot read with
.read.parquet("s3n://bucket/myTest.parquet")
. In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?

2) I am always working with the same file
myTest.parquet
that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.

So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.

Answer

You can save your DataFrame with saveAsTable("TableName") and read it with table("TableName"). And the location can be set by spark.sql.warehouse.dir. And you can overwrite a file with mode(SaveMode.Ignore). You can read here more from the official documentation.

In Java it would look like this:

SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");

Now you can read from the Data with:

spark.read().table("TableName");