JimLohse JimLohse - 15 days ago 7
Scala Question

Spark spark-submit --jars arguments wants comma list, how to declare a directory of jars?

In Submitting Applications in the Spark docs, as of 1.6.0 and earlier, it's not clear how to specify the --jars argument, as it's apparently not a colon-separated classpath not a directory expansion.

The docs say "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes."


Question: What are all the options for submitting a classpath with
--jars in the spark-submit script in $SPARK_HOME/bin? Anything undocumented that could be submitted as an improvement for docs?


I ask because when I was testing --jars today, we had to explicitly provide a path to each jar:

/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData ---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar /usr/local/spark/jars/thold-0.0.1-1.jar


We are choosing to pre-populate the cluster with all the jars in /usr/local/spark/jars on each worker, it seemed that if no local:/ file:/ or hdfs: was supplied, then the default is file:/ and the driver makes the jars available on a webserver run by the driver. I chose local, as above.

And it seems that we do not need to put the main jar in the --jars argument, I have not tested yet if other classes in the final argument (application-jar arg per docs, i.e. /usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers, or if I need to put the application-jar in the --jars path to get classes not named after --class to be seen.

(And granted with Spark standalone mode using --deploy-mode client, you also have to put a copy of the driver on each worker but you don't know up front which worker will run the driver)

Answer

In this way it worked easily.. instead of specifying each jar with version separately..

#!/bin/sh
# build all other dependent jars in OTHER_JARS

JARS=`find ../lib -name '*.jar'`
OTHER_JARS=""
   for eachjarinlib in $JARS ; do    
if [ "$eachjarinlib" != "APPLICATIONJARTOBEADDEDSEPERATELY.JAR" ]; then
       OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH

spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,APPLICATIONJARTOBEADDEDSEPERATELY.JAR