Revathy Murugesan Revathy Murugesan - 15 days ago 5
JSON Question

sparksql Convert dataframe to json

My requirement is to pass dataframe as input parameter to a scala class which saves the data in json format to hdfs.

The input parameter looks like this:

case class ReportA(
parm1: String,
parm2: String,
parm3: Double,
parm4: Double,
parm5: DataFrame
)


I have created a JSON object for this parameter like:

def write(xx: ReportA) = JsObject(
"field1" -> JsString(xx.parm1),
"field2" -> JsString(xx.parm2),
"field3" -> JsNumber(xx.parm3),
"field4" -> JsNumber(xx.parm4),
"field5" -> JsArray(xx.parm5)
)


parm5 is a dataframe and wanted to convert as Json array.

How can I convert the dataframe to Json array?

Thank you for your help!!!

Answer

A DataFrame can be seen to be the equivalent of a plain-old table in a database, with rows and columns. You can't just get a simple array from it, the closest you woud come to an array would be with the following structure :

[
    "col1": [val1, val2, ..], 
    "col2": [val3, val4, ..],
    "col3": [val5, val6, ..]
]

To achieve a similar structure, you could use the toJSON method of the DataFrame API to get an RDD<String> and then do collect on it (be careful of any OutOfMemory exceptions).

You now have an Array[String], which you can simply transform in a JsonArray depending on the JSON library you are using.

Beware though, this seems like a really bizarre way to use Spark, you generally don't output and transform an RDD or a DataFrame directly into one of your objects, you usually spill it out onto a storage solution.

Comments