Ninja Ninja - 3 years ago 85
Scala Question

How to read the json file in spark using scala?

I want to read the JSON file in the below format:-

{
"titlename": "periodic",
"atom": [
{
"usage": "neutron",
"dailydata": [
{
"utcacquisitiontime": "2017-03-27T22:00:00Z",
"datatimezone": "+02:00",
"intervalvalue": 28128,
"intervaltime": 15
},
{
"utcacquisitiontime": "2017-03-27T22:15:00Z",
"datatimezone": "+02:00",
"intervalvalue": 25687,
"intervaltime": 15
}
]
}
]
}


I am writing my read line as:

sqlContext.read.json("user/files_fold/testing-data.json").printSchema


But I not getting the desired result-

root
|-- _corrupt_record: string (nullable = true)


Please help me on this

Answer Source

Since your json file contains line delimiter, it is not a valid json format. You can apply some additional steps to convert the read texts into valid json.

I have used wholeTextFiles to read the file and applied some functions to convert it to a valid json format.

val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").map(tuple => tuple._2.replace("\n", "").trim)

val df = sqlContext.read.json(json)

You should have the final valid dataframe as

+--------------------------------------------------------------------------------------------------------+---------+
|atom                                                                                                    |titlename|
+--------------------------------------------------------------------------------------------------------+---------+
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
+--------------------------------------------------------------------------------------------------------+---------+

And valid schema as

root
 |-- atom: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- dailydata: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- datatimezone: string (nullable = true)
 |    |    |    |    |-- intervaltime: long (nullable = true)
 |    |    |    |    |-- intervalvalue: long (nullable = true)
 |    |    |    |    |-- utcacquisitiontime: string (nullable = true)
 |    |    |-- usage: string (nullable = true)
 |-- titlename: string (nullable = true)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download