I have some json data that are key value pairs with ints as the keys and a lists of ints as the values. I want to read this data into a map and then broadcast it so it can be used by another RDD for quick lookup.
I have code that worked on a 1.6.1 spark cluster that is in a data center, but the same code won't work in a 2.0.1 spark cluster in AWS. The 1.6.1 code that works:
sc.broadcast(sqlContext.read.schema(mySchema).json(myPath).map(r => (r.getInt(0), r.getAs[WrappedArray[Int]].toArray)).collectAsMap)
val myData = sqlContext.read.schema(mySchema).json(myPath).map(r => (r.getInt(0), r.getSeq[Int].toArray))
org.apache.spark.sql.Dataset[(Int, Array[Int])] = [_1: int, _2: array<int>]
I figured out that my problem was with spark shell in 2.0.1. The code I posted works fine if I use the existing sc and sqlContext that is part of the spark session that the shell creates. If I call stop and create a new session with a custom config, I will get the strange error above. I don't like this as I want to change the spark.driver.maxResultSize.
Anyway, the lesson is: if you are testing code using spark shell, use the existing session or it may not work.