dreddy dreddy - 1 year ago 99
Scala Question

How to convert a dataframe to an RDD in scala, not losing the schema of the dataframe

My dataframe is as follows:

storeId| dateId|projectId
9 |2457583| 1047
9 |2457576| 1048

When i do
rd = resultDataframe.rdd
rd only has the data and not the header information. I confirmed this with rd.first where i dont get header info. Also when i try

rd.map(f => f._1+"\t"+f._2+"\t"+f._3).saveAsTextFile("s3://pathinS3/testtab4")

i only see

9 2457583 1047
9 2457576 1048

I would like to be able to convert the resultDataframe into a tab separated csv and store it in s3.

Expected csv output in s3:

storeId dateId projectId
9 2457583 1047
9 2457576 1048

Any help is appreciated. Thanks in advance.

Answer Source

You can do it like this

val rdd = df.rdd
val data = rdd.map(_.mkString("\t"))
val header = sc.parallelize(Seq(df.columns.mkString("\t")))
val rddWitHeader = header.union(data)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download