subho subho - 1 month ago 16
Scala Question

sample spark CSV and JSON program not running in windows

I am running spark program in Windows 10 machine.

I am trying to run the below spark program

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import com.databricks.spark.csv

object json1 {

def main(args : Array[String]){
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g")
val sc = new org.apache.spark.SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)

val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true"))
NyseDF.registerTempTable("NYSE")
NyseDF.printSchema()

}



}


When i run the program through Run application mode in eclispse with passing arguments

as

src/test/resources/demo.text

It fails with below error.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/10/10 11:02:18 INFO SparkContext: Running Spark version 1.6.0
16/10/10 11:02:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/10 11:02:18 INFO SecurityManager: Changing view acls to: subho
16/10/10 11:02:18 INFO SecurityManager: Changing modify acls to: subho
16/10/10 11:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho)
16/10/10 11:02:19 INFO Utils: Successfully started service 'sparkDriver' on port 61108.
16/10/10 11:02:20 INFO Slf4jLogger: Slf4jLogger started
16/10/10 11:02:20 INFO Remoting: Starting remoting
16/10/10 11:02:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.116:61121]
16/10/10 11:02:20 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 61121.
16/10/10 11:02:20 INFO SparkEnv: Registering MapOutputTracker
16/10/10 11:02:20 INFO SparkEnv: Registering BlockManagerMaster
16/10/10 11:02:21 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-69afda02-ccd1-41d1-aa25-830ba366a75c
16/10/10 11:02:21 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB
16/10/10 11:02:21 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/10 11:02:21 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/10/10 11:02:21 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040
16/10/10 11:02:21 INFO Executor: Starting executor ID driver on host localhost
16/10/10 11:02:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61132.
16/10/10 11:02:21 INFO NettyBlockTransferService: Server created on 61132
16/10/10 11:02:21 INFO BlockManagerMaster: Trying to register BlockManager
16/10/10 11:02:21 INFO BlockManagerMasterEndpoint: Registering block manager localhost:61132 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 61132)
16/10/10 11:02:21 INFO BlockManagerMaster: Registered BlockManager
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 107.7 KB)
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.8 KB, free 117.5 KB)
16/10/10 11:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61132 (size: 9.8 KB, free: 1128.4 MB)
16/10/10 11:02:23 INFO SparkContext: Created broadcast 0 from textFile at TextFile.scala:30
16/10/10 11:02:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1293)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174)
at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147)
at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
at json1$.main(json1.scala:22)
at json1.main(json1.scala)
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/demo.text
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1293)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174)
at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147)
at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
at json1$.main(json1.scala:22)
at json1.main(json1.scala)
16/10/10 11:02:23 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 11:02:23 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040
16/10/10 11:02:23 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 11:02:23 INFO MemoryStore: MemoryStore cleared
16/10/10 11:02:23 INFO BlockManager: BlockManager stopped
16/10/10 11:02:24 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 11:02:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 11:02:24 INFO SparkContext: Successfully stopped SparkContext
16/10/10 11:02:24 INFO ShutdownHookManager: Shutdown hook called
16/10/10 11:02:24 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-7f53ea20-a38c-46d5-8476-a1ae040736ac


Below is the main error msg

Input path does not exist: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/demo.text

I have the file in the below location.

!]1

When i ran the below program it ran sucussfully,

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import com.databricks.spark.csv

object json1 {

def main(args : Array[String]){
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g")
val sc = new org.apache.spark.SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)

/* val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true"))
NyseDF.registerTempTable("NYSE")
NyseDF.printSchema()

print(sqlc.sql("select distinct(symbol) from NYSE").collect().toList)*/

val PersonDF = sqlc.jsonFile("src/test/resources/Person.json")

// PersonDF.printSchema()

PersonDF.registerTempTable("Person")

sqlc.sql("select * from Person where age < 60").collect().foreach(print)

}


Below is the log file.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/10/10 11:54:12 INFO SparkContext: Running Spark version 1.6.0
16/10/10 11:54:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/10 11:54:13 INFO SecurityManager: Changing view acls to: subho
16/10/10 11:54:13 INFO SecurityManager: Changing modify acls to: subho
16/10/10 11:54:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho)
16/10/10 11:54:14 INFO Utils: Successfully started service 'sparkDriver' on port 51113.
16/10/10 11:54:14 INFO Slf4jLogger: Slf4jLogger started
16/10/10 11:54:14 INFO Remoting: Starting remoting
16/10/10 11:54:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.116:51126]
16/10/10 11:54:15 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 51126.
16/10/10 11:54:15 INFO SparkEnv: Registering MapOutputTracker
16/10/10 11:54:15 INFO SparkEnv: Registering BlockManagerMaster
16/10/10 11:54:15 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-a52a5d5a-075b-4859-8434-935fdaba8538
16/10/10 11:54:15 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB
16/10/10 11:54:15 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/10 11:54:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/10/10 11:54:15 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040
16/10/10 11:54:15 INFO Executor: Starting executor ID driver on host localhost
16/10/10 11:54:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51137.
16/10/10 11:54:15 INFO NettyBlockTransferService: Server created on 51137
16/10/10 11:54:15 INFO BlockManagerMaster: Trying to register BlockManager
16/10/10 11:54:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:51137 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 51137)
16/10/10 11:54:15 INFO BlockManagerMaster: Registered BlockManager
16/10/10 11:54:17 INFO JSONRelation: Listing file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json on driver
16/10/10 11:54:17 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
at org.apache.spark.sql.SQLContext.jsonFile(SQLContext.scala:1011)
at json1$.main(json1.scala:28)
at json1.main(json1.scala)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 128.0 KB, free 128.0 KB)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.1 KB, free 142.1 KB)
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.4 MB)
16/10/10 11:54:18 INFO SparkContext: Created broadcast 0 from jsonFile at json1.scala:28
16/10/10 11:54:18 INFO FileInputFormat: Total input paths to process : 1
16/10/10 11:54:18 INFO SparkContext: Starting job: jsonFile at json1.scala:28
16/10/10 11:54:18 INFO DAGScheduler: Got job 0 (jsonFile at json1.scala:28) with 2 output partitions
16/10/10 11:54:18 INFO DAGScheduler: Final stage: ResultStage 0 (jsonFile at json1.scala:28)
16/10/10 11:54:18 INFO DAGScheduler: Parents of final stage: List()
16/10/10 11:54:18 INFO DAGScheduler: Missing parents: List()
16/10/10 11:54:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28), which has no missing parents
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 146.3 KB)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 148.6 KB)
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:51137 (size: 2.4 KB, free: 1128.4 MB)
16/10/10 11:54:18 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/10/10 11:54:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28)
16/10/10 11:54:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/10/10 11:54:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/10/10 11:54:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/10/10 11:54:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/10/10 11:54:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/10/10 11:54:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/10/10 11:54:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/10/10 11:54:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2886 bytes result sent to driver
16/10/10 11:54:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2886 bytes result sent to driver
16/10/10 11:54:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1287 ms on localhost (1/2)
16/10/10 11:54:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1264 ms on localhost (2/2)
16/10/10 11:54:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/10/10 11:54:19 INFO DAGScheduler: ResultStage 0 (jsonFile at json1.scala:28) finished in 1.314 s
16/10/10 11:54:19 INFO DAGScheduler: Job 0 finished: jsonFile at json1.scala:28, took 1.413653 s
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:51137 in memory (size: 2.4 KB, free: 1128.4 MB)
16/10/10 11:54:20 INFO ContextCleaner: Cleaned accumulator 1
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:51137 in memory (size: 14.1 KB, free: 1128.4 MB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 59.6 KB, free 59.6 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 13.8 KB, free 73.3 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:51137 (size: 13.8 KB, free: 1128.4 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 2 from collect at json1.scala:34
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 128.0 KB, free 201.3 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 14.1 KB, free 215.4 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.3 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 3 from collect at json1.scala:34
16/10/10 11:54:21 INFO FileInputFormat: Total input paths to process : 1
16/10/10 11:54:21 INFO SparkContext: Starting job: collect at json1.scala:34
16/10/10 11:54:21 INFO DAGScheduler: Got job 1 (collect at json1.scala:34) with 2 output partitions
16/10/10 11:54:21 INFO DAGScheduler: Final stage: ResultStage 1 (collect at json1.scala:34)
16/10/10 11:54:21 INFO DAGScheduler: Parents of final stage: List()
16/10/10 11:54:21 INFO DAGScheduler: Missing parents: List()
16/10/10 11:54:21 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34), which has no missing parents
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.6 KB, free 223.0 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.1 KB, free 227.1 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:51137 (size: 4.1 KB, free: 1128.3 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
16/10/10 11:54:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34)
16/10/10 11:54:21 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/10/10 11:54:21 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:21 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:21 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/10/10 11:54:21 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92
16/10/10 11:54:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:51137 in memory (size: 13.8 KB, free: 1128.4 MB)
16/10/10 11:54:22 INFO GenerateUnsafeProjection: Code generated in 548.352258 ms
16/10/10 11:54:22 INFO GeneratePredicate: Code generated in 5.245214 ms
16/10/10 11:54:22 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2283 bytes result sent to driver
16/10/10 11:54:22 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2536 bytes result sent to driver
16/10/10 11:54:22 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 755 ms on localhost (1/2)
16/10/10 11:54:22 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 759 ms on localhost (2/2)
16/10/10 11:54:22 INFO DAGScheduler: ResultStage 1 (collect at json1.scala:34) finished in 0.760 s
16/10/10 11:54:22 INFO DAGScheduler: Job 1 finished: collect at json1.scala:34, took 0.779652 s
16/10/10 11:54:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
[53,Barack,Obama]16/10/10 11:54:22 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 11:54:22 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040
16/10/10 11:54:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 11:54:22 INFO MemoryStore: MemoryStore cleared
16/10/10 11:54:22 INFO BlockManager: BlockManager stopped
16/10/10 11:54:22 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 11:54:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 11:54:22 INFO SparkContext: Successfully stopped SparkContext
16/10/10 11:54:22 INFO ShutdownHookManager: Shutdown hook called
16/10/10 11:54:22 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-6cab6329-83f1-4af4-b64c-c869550405a4
16/10/10 11:54:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.


Thanks and Regards,

Answer

The important section of the stacktrace is here:

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

One possibility is to download winutils.exe (e.g. from here), put in in a folder called bin (as a subdirectory or your home directory, e.g. C:\Users\XXX\bin\winutils.exe) and then add this line at the beginning of your code:

System.setProperty("hadoop.home.dir", raw"C:\Users\XXX\")
Comments