ThatDataGuy ThatDataGuy - 1 month ago 18
Python Question

pyspark Do python processes on an executor node share broadcast variables in ram?

I have a node that has 24 cores and 124Gb ram in my spark cluster. When I set the spark.executor.memory field to 4g, and then broadcast a variable that takes 3.5gb to store in ram, will the cores collectively hold 24 copies of that variable? Or one copy?

I am using pyspark - v1.6.2

Answer

I believe that PySpark doesn't use any form of shared memory to share broadcast variables between the workers.

Broadcast variables are loaded in the main function of the worker which is called only after forking from the daemon so there are not accessible from the parent process space.

Comments