I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:
To read the file if it exists:
df = sqlContext
.read.parquet("s3n://bucket/myTest.parquet")
.toDF("key", "value", "date", "qty")
df.write.parquet("s3n://bucket/myTest.parquet")
write
myTest.parquet
.read.parquet("s3n://bucket/myTest.parquet")
myTest.parquet
You can save your DataFrame with saveAsTable("TableName")
and read it with table("TableName")
. And the location can be set by spark.sql.warehouse.dir
. And you can overwrite a file with mode(SaveMode.Ignore)
. You can read here more from the official documentation.
In Java it would look like this:
SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");
Now you can read from the Data with:
spark.read().table("TableName");