dreddy dreddy - 1 month ago 16
Scala Question

How to convert a dataframe to an RDD in scala, not losing the schema of the dataframe

My dataframe is as follows:

storeId| dateId|projectId
9 |2457583| 1047
9 |2457576| 1048


When i do
rd = resultDataframe.rdd
rd only has the data and not the header information. I confirmed this with rd.first where i dont get header info. Also when i try

rd.map(f => f._1+"\t"+f._2+"\t"+f._3).saveAsTextFile("s3://pathinS3/testtab4")


i only see

9 2457583 1047
9 2457576 1048


I would like to be able to convert the resultDataframe into a tab separated csv and store it in s3.

Expected csv output in s3:

storeId dateId projectId
9 2457583 1047
9 2457576 1048


Any help is appreciated. Thanks in advance.

Answer

You can do it like this

val rdd = df.rdd
val data = rdd.map(_.mkString("\t"))
val header = sc.parallelize(Seq(df.columns.mkString("\t")))
val rddWitHeader = header.union(data)