I am looking for a way to export data from Apache Spark to various other tools in JSON format. I presume there must be a really straightforward way to do it.
Example: I have the following JSON file 'jfile.json':
jsonRDD = jsonFile('jfile.json')
[Row(key=value_a1, key2=value_b1),Row(key=value_a2, key2=value_b2)]
I can't see an easy way to do it. One solution is to convert each element of the
SchemaRDD to a
String, ending up with an
RDD[String] where each of the elements is formatted JSON for that row. So, you need to write your own JSON serializer. That's the easy part. It may not be super fast but it should work in parallel, and you already know how to save an
RDD to a text file.
The key insight is that you can get a representation of the schema out of the
SchemaRDD by calling the
schema method. Then each
Row handed to you by map needs to be traversed recursively in conjunction with the schema. This is actually an in-tandem list traversal for flat JSON, but you may also need to consider nested JSON.
The rest is just a small matter of Python, which I don't speak, but I do have this working in Scala in case it helps you. The parts where the Scala code gets dense actually don't depend on deep Spark knowledge so if you can understand the basic recursion and know Python you should be able to make it work. The bulk of the work for you is figuring out how to work with a
pyspark.sql.Row and a
pyspark.sql.StructType in the Python API.
One word of caution: I'm pretty sure my code doesn't yet work in the case of missing values -- the
formatItem method needs to handle null elements.
Edit: In Spark 1.2.0 the
toJSON method was introduced to
SchemaRDD, making this a much simpler problem -- see the answer by @jegordon.