The step is :
1.package all the python files into the pyspark.zip when building Spark.
2.spark-submit to Yarn it distributed the pyspark.zip to all the machine.
3.Spark Worker find the pyspark.zip and process the python file in it.
But the code here and here shows that it only put the zip files' path into ProcessBuilder's environment. And I haven't find the code that unzip pyspark.zip .
So I'm wondering how does ProcessBuilder unzip the pyspark.zip ?
Or how does Spark Worker run the python files in pyspark.zip ?
In fact if you type
python -h, it will show
Other environment variables: PYTHONPATH : ':'-separated list of directories prefixed to the default module search path. The result is sys.path.
And ProcessBuilder could use the zip without unzip it.
Also,A zip file could be import in Python directly, you don’t need to unzip it.
List commands = new java.util.ArrayList<String>(); commands.add("python"); commands.add("-m"); commands.add("test");//test.py in test.zip ProcessBuilder pb = new ProcessBuilder(); pb.command(commands); Map workerEnv = pb.environment(); workerEnv.put("PYTHONPATH", "/path/to/test.zip"); Process worker = pb.start();