Conan Conan - 3 months ago 15
Scala Question

ClassNotFoundException anonfun when deploy scala code to Spark

I'm new to Apache Spark and I'm trying to deploy a piece of simple scala code to the Spark.

Note: I am trying to connect to a an existing running cluster which I configure via my java parameters to be:

spark.master=spark://MyHostName:7077


Environment


  • Spark 1.5.1 build with scala 2.10

  • Spark runs standalone mode on my local machine

  • OS: Mac OS El Captain

  • JVM: JDK 1.8.0_60

  • IDE: IntelliJ IDEA Community 14.1.5

  • Scala version: 2.10.4



sbt: 0.13.8

Code

import org.apache.spark.{SparkConf, SparkContext}
object HelloSpark {
def main(args: Array[String]) {
val logFile = "/README.md"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
println("%s done!".format(numAs))
}
}


build.sbt

name := "data-streamer210"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.5.1",
"org.apache.spark" % "spark-streaming_2.10" % "1.5.1",
"org.apache.spark" % "spark-mllib_2.10" % "1.5.1",
"org.apache.spark" % "spark-bagel_2.10" % "1.5.1",
"org.apache.spark" % "spark-streaming-twitter_2.10" % "1.5.1"
)


Error

15/10/19 19:40:09 INFO SparkContext: Starting job: count at HelloSpark.scala:14
15/10/19 19:40:09 INFO DAGScheduler: Got job 0 (count at HelloSpark.scala:14) with 2 output partitions
15/10/19 19:40:09 INFO DAGScheduler: Final stage: ResultStage 0(count at HelloSpark.scala:14)
15/10/19 19:40:09 INFO DAGScheduler: Parents of final stage: List()
15/10/19 19:40:09 INFO DAGScheduler: Missing parents: List()
15/10/19 19:40:09 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at HelloSpark.scala:14), which has no missing parents
15/10/19 19:40:09 INFO MemoryStore: ensureFreeSpace(3192) called with curMem=120313, maxMem=2061647216
15/10/19 19:40:09 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 1966.0 MB)
15/10/19 19:40:09 INFO MemoryStore: ensureFreeSpace(1892) called with curMem=123505, maxMem=2061647216
15/10/19 19:40:09 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1892.0 B, free 1966.0 MB)
15/10/19 19:40:09 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50941 (size: 1892.0 B, free: 1966.1 MB)
15/10/19 19:40:09 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
15/10/19 19:40:09 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at HelloSpark.scala:14)
15/10/19 19:40:09 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@127.0.0.1:50951/user/Executor#-147774947]) with ID 0
15/10/19 19:40:10 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:10 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@127.0.0.1:50952/user/Executor#1450479604]) with ID 2
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@127.0.0.1:50957/user/Executor#1447408721]) with ID 1
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@127.0.0.1:50955/user/Executor#1397136754]) with ID 3
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50963 with 530.0 MB RAM, BlockManagerId(0, 127.0.0.1, 50963)
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50964 with 530.0 MB RAM, BlockManagerId(2, 127.0.0.1, 50964)
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50965 with 530.0 MB RAM, BlockManagerId(1, 127.0.0.1, 50965)
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50966 with 530.0 MB RAM, BlockManagerId(3, 127.0.0.1, 50966)
15/10/19 19:40:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50963 (size: 1892.0 B, free: 530.0 MB)
15/10/19 19:40:11 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 127.0.0.1): java.lang.ClassNotFoundException: HelloSpark$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 1]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 3, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50966 (size: 1892.0 B, free: 530.0 MB)
15/10/19 19:40:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50964 (size: 1892.0 B, free: 530.0 MB)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 2]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 3]
15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 4]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 5, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 6, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 5]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 7, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 6]
15/10/19 19:40:11 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
15/10/19 19:40:11 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/19 19:40:11 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/10/19 19:40:11 INFO DAGScheduler: ResultStage 0 (count at HelloSpark.scala:14) failed in 2.613 s
15/10/19 19:40:11 INFO DAGScheduler: Job 0 failed: count at HelloSpark.scala:14, took 2.716305 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 127.0.0.1): java.lang.ClassNotFoundException: HelloSpark$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at HelloSpark$.main(HelloSpark.scala:14)
at HelloSpark.main(HelloSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.ClassNotFoundException: HelloSpark$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/10/19 19:40:11 INFO SparkContext: Invoking stop() from shutdown hook
15/10/19 19:40:11 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6, 127.0.0.1): org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:204)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/10/19 19:40:11 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/10/19 19:40:11 INFO SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
15/10/19 19:40:11 INFO DAGScheduler: Stopping DAGScheduler
15/10/19 19:40:11 INFO SparkDeploySchedulerBackend: Shutting down all executors
15/10/19 19:40:11 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
15/10/19 19:40:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/10/19 19:40:11 INFO MemoryStore: MemoryStore cleared
15/10/19 19:40:11 INFO BlockManager: BlockManager stopped
15/10/19 19:40:11 INFO BlockManagerMaster: BlockManagerMaster stopped
15/10/19 19:40:11 INFO SparkContext: Successfully stopped SparkContext
15/10/19 19:40:11 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/10/19 19:40:11 INFO ShutdownHookManager: Shutdown hook called
15/10/19 19:40:11 INFO ShutdownHookManager: Deleting directory /private/var/folders/q9/m_d81ms107n09tj8k5wbzfb40000gp/T/spark-53ce9474-5488-4d50-bfb6-c58ddeed7640

Process finished with exit code 1

Answer

When you run Spark from IntelliJ you can either connect to a "local" spark JVM or to a remote cluster.

If you set you master to be local (e.g., setMaster("local[*]")), then any code you have in your local scope/project will be available to this temporary, local (single JVM) cluster you just created. Everything runs locally and will exit when your tests ends (if you running a unit test), or when you exit the app if you are running it as an app inside IntelliJ.

However, if you set master to point to a remote cluster (say setMaster("spark://localhost:7077")) you need to make sure that your cluster has access to your new code (in your case it needs to have access to the closure you are passing to filter).

When I want to execute a new piece of code on a running Spark cluster, I usually do that by packaging my app in an Uber Jar (see sbt-assembly) and then passing this as an argument in spark-submit (see more details by clicking on the link).

Comments