郭同jetNLP 郭同jetNLP - 7 months ago 50
Java Question

Can not understand how Spark let python run at Yarn? How does the ProcessBuilder deal with zip file?

The step is :

1.package all the python files into the pyspark.zip when building Spark.

2.spark-submit to Yarn it distributed the pyspark.zip to all the machine.

3.Spark Worker find the pyspark.zip and process the python file in it.

But the code here and here shows that it only put the zip files' path into ProcessBuilder's environment. And I haven't find the code that unzip pyspark.zip .

So I'm wondering how does ProcessBuilder unzip the pyspark.zip ?
Or how does Spark Worker run the python files in pyspark.zip ?

Answer

In fact if you type python -h, it will show

Other environment variables:
PYTHONPATH   : ':'-separated list of directories prefixed to the default module search path.  The result is sys.path.

And ProcessBuilder could use the zip without unzip it.

Also,A zip file could be import in Python directly, you don’t need to unzip it.

List commands = new java.util.ArrayList<String>();
commands.add("python");
commands.add("-m");
commands.add("test");//test.py in test.zip
ProcessBuilder pb = new ProcessBuilder();
pb.command(commands);
Map workerEnv = pb.environment();
workerEnv.put("PYTHONPATH", "/path/to/test.zip");
Process worker = pb.start();
Comments