SakshamB SakshamB - 1 month ago 8
Java Question

Spark example word count execution failed for java

I was trying to run the spark word count example at https://spark.apache.org/examples.html but the execution fails with a null pointer error, I am working in a standalone environment using the files on my local machine. my console looks something like this...

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/07/08 18:55:52 INFO SecurityManager: Changing view acls to: saksham_batra
15/07/08 18:55:52 INFO SecurityManager: Changing modify acls to: saksham_batra
15/07/08 18:55:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(saksham_batra); users with modify permissions: Set(saksham_batra)
15/07/08 18:55:52 INFO Slf4jLogger: Slf4jLogger started
15/07/08 18:55:53 INFO Remoting: Starting remoting
15/07/08 18:55:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@BLRKEC350859D.ad.infosys.com:51119]
15/07/08 18:55:53 INFO Utils: Successfully started service 'sparkDriver' on port 51119.
15/07/08 18:55:53 INFO SparkEnv: Registering MapOutputTracker
15/07/08 18:55:53 INFO SparkEnv: Registering BlockManagerMaster
15/07/08 18:55:53 INFO DiskBlockManager: Created local directory at C:\Users\saksham_batra\AppData\Local\Temp\spark-local-20150708185553-431a
15/07/08 18:55:53 INFO MemoryStore: MemoryStore started with capacity 483.0 MB
15/07/08 18:55:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/08 18:55:53 INFO HttpFileServer: HTTP File server directory is C:\Users\saksham_batra\AppData\Local\Temp\spark-5f64f0d1-93cd-49fb-80ab-8a1c03dcb5e2
15/07/08 18:55:53 INFO HttpServer: Starting HTTP Server
15/07/08 18:55:53 INFO Utils: Successfully started service 'HTTP file server' on port 51120.
15/07/08 18:55:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/07/08 18:55:53 INFO SparkUI: Started SparkUI at http://BLRKEC350859D.ad.infosys.com:4040
15/07/08 18:55:53 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@BLRKEC350859D.ad.infosys.com:51119/user/HeartbeatReceiver
15/07/08 18:55:54 INFO NettyBlockTransferService: Server created on 51131
15/07/08 18:55:54 INFO BlockManagerMaster: Trying to register BlockManager
15/07/08 18:55:54 INFO BlockManagerMasterActor: Registering block manager localhost:51131 with 483.0 MB RAM, BlockManagerId(<driver>, localhost, 51131)
15/07/08 18:55:54 INFO BlockManagerMaster: Registered BlockManager
15/07/08 18:55:54 INFO MemoryStore: ensureFreeSpace(133168) called with curMem=0, maxMem=506493665
15/07/08 18:55:54 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 130.0 KB, free 482.9 MB)
15/07/08 18:55:54 INFO MemoryStore: ensureFreeSpace(18512) called with curMem=133168, maxMem=506493665
15/07/08 18:55:54 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 18.1 KB, free 482.9 MB)
15/07/08 18:55:54 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:51131 (size: 18.1 KB, free: 483.0 MB)
15/07/08 18:55:54 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/07/08 18:55:54 INFO SparkContext: Created broadcast 0 from textFile at SparkWordCount.java:22
15/07/08 18:55:54 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
at org.apache.spark.SparkContext$$anonfun$26.apply(SparkContext.scala:696)
at org.apache.spark.SparkContext$$anonfun$26.apply(SparkContext.scala:696)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:170)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:170)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:170)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
at org.apache.spark.api.java.JavaPairRDD.reduceByKey(JavaPairRDD.scala:507)
at spark.spark1.SparkWordCount.main(SparkWordCount.java:44)
15/07/08 18:55:54 INFO FileInputFormat: Total input paths to process : 1
15/07/08 18:55:54 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/07/08 18:55:54 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/07/08 18:55:54 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/07/08 18:55:54 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/07/08 18:55:54 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/07/08 18:55:55 INFO SparkContext: Starting job: saveAsTextFile at SparkWordCount.java:47
15/07/08 18:55:55 INFO DAGScheduler: Registering RDD 3 (mapToPair at SparkWordCount.java:41)
15/07/08 18:55:55 INFO DAGScheduler: Got job 0 (saveAsTextFile at SparkWordCount.java:47) with 1 output partitions (allowLocal=false)
15/07/08 18:55:55 INFO DAGScheduler: Final stage: Stage 1(saveAsTextFile at SparkWordCount.java:47)
15/07/08 18:55:55 INFO DAGScheduler: Parents of final stage: List(Stage 0)
15/07/08 18:55:55 INFO DAGScheduler: Missing parents: List(Stage 0)
15/07/08 18:55:55 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at mapToPair at SparkWordCount.java:41), which has no missing parents
15/07/08 18:55:55 INFO MemoryStore: ensureFreeSpace(4264) called with curMem=151680, maxMem=506493665
15/07/08 18:55:55 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 482.9 MB)
15/07/08 18:55:55 INFO MemoryStore: ensureFreeSpace(3025) called with curMem=155944, maxMem=506493665
15/07/08 18:55:55 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.0 KB, free 482.9 MB)
15/07/08 18:55:55 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:51131 (size: 3.0 KB, free: 483.0 MB)
15/07/08 18:55:55 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/07/08 18:55:55 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/07/08 18:55:55 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[3] at mapToPair at SparkWordCount.java:41)
15/07/08 18:55:55 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/07/08 18:55:55 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1318 bytes)
15/07/08 18:55:55 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/07/08 18:55:55 INFO CacheManager: Partition rdd_1_0 not found, computing it
15/07/08 18:55:55 INFO HadoopRDD: Input split: file:/C:/Users/saksham_batra/Desktop/sample/New Text Document.txt:0+658
15/07/08 18:55:55 INFO MemoryStore: ensureFreeSpace(2448) called with curMem=158969, maxMem=506493665
15/07/08 18:55:55 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 2.4 KB, free 482.9 MB)
15/07/08 18:55:55 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:51131 (size: 2.4 KB, free: 483.0 MB)
15/07/08 18:55:55 INFO BlockManagerMaster: Updated info of block rdd_1_0
15/07/08 18:55:55 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2464 bytes result sent to driver
15/07/08 18:55:55 INFO DAGScheduler: Stage 0 (mapToPair at SparkWordCount.java:41) finished in 0.262 s
15/07/08 18:55:55 INFO DAGScheduler: looking for newly runnable stages
15/07/08 18:55:55 INFO DAGScheduler: running: Set()
15/07/08 18:55:55 INFO DAGScheduler: waiting: Set(Stage 1)
15/07/08 18:55:55 INFO DAGScheduler: failed: Set()
15/07/08 18:55:55 INFO DAGScheduler: Missing parents for Stage 1: List()
15/07/08 18:55:55 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[5] at saveAsTextFile at SparkWordCount.java:47), which is now runnable
15/07/08 18:55:55 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 269 ms on localhost (1/1)
15/07/08 18:55:55 INFO MemoryStore: ensureFreeSpace(95184) called with curMem=161417, maxMem=506493665
15/07/08 18:55:55 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 93.0 KB, free 482.8 MB)
15/07/08 18:55:55 INFO MemoryStore: ensureFreeSpace(56987) called with curMem=256601, maxMem=506493665
15/07/08 18:55:55 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 55.7 KB, free 482.7 MB)
15/07/08 18:55:55 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:51131 (size: 55.7 KB, free: 483.0 MB)
15/07/08 18:55:55 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/07/08 18:55:55 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/07/08 18:55:55 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[5] at saveAsTextFile at SparkWordCount.java:47)
15/07/08 18:55:55 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/07/08 18:55:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/07/08 18:55:55 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1056 bytes)
15/07/08 18:55:55 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/07/08 18:55:55 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/07/08 18:55:55 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/07/08 18:55:55 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/07/08 18:55:55 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/07/08 18:55:55 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/08 18:55:55 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
15/07/08 18:55:55 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
15/07/08 18:55:55 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

15/07/08 18:55:55 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
15/07/08 18:55:55 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/07/08 18:55:55 INFO TaskSchedulerImpl: Cancelling stage 1
15/07/08 18:55:55 INFO DAGScheduler: Job 0 failed: saveAsTextFile at SparkWordCount.java:47, took 0.651288 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


I am new to spark... cannot figure out where is it going wrong... please help.

P.S the first error java.io.IOException is coming in other examples as well but isn't hampering the execution in any way.

Answer

Try setting a system environment variable as HADOOP_HOME=[SPARKPATH] and adding the winutils.exe to your spark bin folder from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

This will probably solve both errors (at least the first). The second is related to an interaction with windows and I believe the winutils resolves that also. Both are really Hadoop Windows bugs