user3646519 user3646519 - 23 days ago 11
JSON Question

Spark Sql Flatten Json

I have a JSON which looks like this

{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley","year":2012}]}


I want to store output in a csv file like this:

Michael,{"sname":"stanford", "year":2010}

Michael,{"sname":"berkeley", "year":2012}


I have tried the following:

val people = sqlContext.read.json("people.json")
val flattened = people.select($"name", explode($"schools").as("schools_flat"))


The above code does not give schools_flat as a json.
Any ides on how to get the expected output.

Thanks

Answer

You need to specify schema explicitly to read the json file in the desired way. In this case it would be like this:

import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType

case class json_schema_class(  cities: String,  name : String, schools: Array[String])
var json_schema = ScalaReflection.schemaFor[json_schema_class].dataType.asInstanceOf[StructType]

var people = sqlContext.read.schema( json_schema ).json("people.json")
var flattened = people.select($"name", explode($"schools").as("schools_flat"))

The 'flattened' dataframe is like this:

+-------+--------------------+
|   name|        schools_flat|
+-------+--------------------+
|Michael|{"sname":"stanfor...|
|Michael|{"sname":"berkele...|
+-------+--------------------+