raj kumar raj kumar - 11 months ago 194
Scala Question

how to read json with schema in spark dataframes/spark sql

please help me out or provide some good suggestion on how to read this json



Answer Source

Seems like your json is not valid. pls check with http://www.jsoneditoronline.org/

Please see an-introduction-to-json-support-in-spark-sql.html

if you want to register as the table you can register like below and print the schema.

DataFrame df = sqlContext.read().json("/path/to/validjsonfile").toDF();

Below is sample code snippet

DataFrame app = df.select("toplevel");
DataFrame appName = app.select("toplevel.sublevel");

Example with scala :

{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}
{"name":"Andy", "cities":["santa cruz"], "schools":[{"sname":"ucsb", "year":2011}]}
{"name":"Justin", "cities":["portland"], "schools":[{"sname":"berkeley", "year":2014}]}

 val people = sqlContext.read.json("people.json")
people: org.apache.spark.sql.DataFrame

Reading top level field

val names = people.select('name).collect()
names: Array[org.apache.spark.sql.Row] = Array([Michael], [Andy], [Justin])

 names.map(row => row.getString(0))
res88: Array[String] = Array(Michael, Andy, Justin)

Use the select() method to specify the top-level field, collect() to collect it into an Array[Row], and the getString() method to access a column inside each Row.

Flatten and Read a JSON Array

each Person has an array of "cities". Let's flatten these arrays and read out all their elements.

val flattened = people.explode("cities", "city"){c: List[String] => c}
flattened: org.apache.spark.sql.DataFrame

val allCities = flattened.select('city).collect()
allCities: Array[org.apache.spark.sql.Row]

 allCities.map(row => row.getString(0))
res92: Array[String] = Array(palo alto, menlo park, santa cruz, portland)

The explode() method explodes, or flattens, the cities array into a new column named "city". We then use select() to select the new column, collect() to collect it into an Array[Row], and getString() to access the data inside each Row.

Read an Array of Nested JSON Objects, Unflattened

read out the "schools" data, which is an array of nested JSON objects. Each element of the array holds the school name and year:

 val schools = people.select('schools).collect()
schools: Array[org.apache.spark.sql.Row]

val schoolsArr = schools.map(row => row.getSeq[org.apache.spark.sql.Row](0))
schoolsArr: Array[Seq[org.apache.spark.sql.Row]]

 schoolsArr.foreach(schools => {
    schools.map(row => print(row.getString(0), row.getLong(1)))

Use select() and collect() to select the "schools" array and collect it into an Array[Row]. Now, each "schools" array is of type List[Row], so we read it out with the getSeq[Row]() method. Finally, we can read the information for each individual school, by calling getString() for the school name and getLong() for the school year.