Eugene Cuz Eugene Cuz - 1 month ago 5
Scala Question

org.apache.spark.SparkException: Task not serializable. How to run a method in map{}

I try to call a method on a value. But get an error. My method.

processDate(p(2))


The values look somewhat like 20160125204123

This is my class.

class ScalaJob(sc: SparkContext) {
def run(filePath: String) : RDD[(String, String, String)] = {
//pass the file
val file = sc.textFile(filePath);
//find values in every raw
val values = file.map{
dataRaw =>
val p = dataRaw.split("[|]",-1)
(p(1), processDate(p(2)), p(32))
}


My method should return a string

def processDate(s: String) : String = {


Is there a way to make it work?

Answer

Any code used inside RDD.map in this case file.map will be serialized and shipped to executors. So for this to happen, the code should be serializable. In this case you have used the method processDate which is defined elsewhere. Make sure the class in which the method is defined is serializable. Note: you need to make the entire dependency chain serializable. One quick option is bind processDate to a val as a function and use it inside RDD. Or define the method in an object. Example:

class ScalaJob(sc: SparkContext) {
  def run(filePath: String): RDD[(String, String, String)] = {
    //pass the file
    val file = sc.textFile(filePath);
   //find values in every raw
    val process = processDate _
    val values = file.map {
      dataRaw =>
        val p = dataRaw.split("[|]", -1)
        (p(1), process(p(2)), p(32))
    }
  }
}

See Spark Task Not Serializable for more details

Comments