smeeb smeeb - 2 months ago 41
JSON Question

How to read in-memory JSON string into Spark DataFrame

I'm trying to read an in-memory JSON string into a Spark DataFrame on the fly:

var someJSON : String = getJSONSomehow()
val someDF : DataFrame = magic.convert(someJSON)


I've spent quite a bit of time looking at the Spark API, and the best I can find is to use a
sqlContext
like so:

var someJSON : String = getJSONSomehow()
val tmpFile : Output = Resource
.fromFile(s"/tmp/json/${UUID.randomUUID().toString()}")
tmpFile.write("hello")(Codec.UTF8)
val someDF : DataFrame = sqlContext.read().json(tmpFile)


But this feels kind of awkward/wonky and imposes the following constraints:


  1. It requires me to format my JSON to one object per line (per documentation); and

  2. It forces me to write the JSON to a temp file, which is slow and awkward; and

  3. It forces me to clean up temp files over time, which is cumbersome and feels "wrong" to me



So I ask: Is there a direct and more efficient way to convert a JSON string into a Spark DataFrame?

Answer

From Spark SQL guide:

val otherPeopleRDD = spark.sparkContext.makeRDD(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

This creates a DataFrame from an intermediate RDD (created by passing a String).

Comments