Ilya Smagin Ilya Smagin - 1 month ago 14
Scala Question

Apache Spark: what happens when one uses a host object value within a worker that has not been broadcasted?

Imagine a simple program like this:

def main(args: String[]):
val hostLocalValue = args(0).toInt
val someRdd = getSomeIntRdd
val mySum = someRdd
.map(x => if (x < 0) 1 else hostLocalValue)
.reduce(_ + _)
print(mySum)


The map function which is executed at remote worker uses host-local value without that being broadcasted. How does this work? If THAT works all the time, then what do we need broadcast() for?

Answer

In your example 'hostLocalValue' will be serialized and send to each worker nodes with 'map' closure. If you have 1000 partitions this variable will be distributed to workers 1000 times. Your variable is Int, so it's ok. But if you variable would be dictionary Map ~100mb, you'll have to send 100 gigs over network.

But if you'll wrap your dictionary in broadcast you have to ship it only once => Benefit!