My requirement is to pass dataframe as input parameter to a scala class which saves the data in json format to hdfs.
The input parameter looks like this:
case class ReportA(
def write(xx: ReportA) = JsObject(
"field1" -> JsString(xx.parm1),
"field2" -> JsString(xx.parm2),
"field3" -> JsNumber(xx.parm3),
"field4" -> JsNumber(xx.parm4),
"field5" -> JsArray(xx.parm5)
A DataFrame can be seen to be the equivalent of a plain-old table in a database, with rows and columns. You can't just get a simple array from it, the closest you woud come to an array would be with the following structure :
[ "col1": [val1, val2, ..], "col2": [val3, val4, ..], "col3": [val5, val6, ..] ]
To achieve a similar structure, you could use the
toJSON method of the DataFrame API to get an
RDD<String> and then do
collect on it (be careful of any
You now have an
Array[String], which you can simply transform in a
JsonArray depending on the JSON library you are using.
Beware though, this seems like a really bizarre way to use Spark, you generally don't output and transform an RDD or a DataFrame directly into one of your objects, you usually spill it out onto a storage solution.