Jon Jon - 3 months ago 34
Scala Question

Is it possible to execute a command on all workers within Apache Spark?

I have a situation where I want to execute a system process on each worker within Spark. I want this process to be run an each machine once. Specifically this process starts a daemon which is required to be running before the rest of my program executes. Ideally this should execute before I've read any data in.

I'm on Spark 2.0.2 and using dynamic allocation.


You may be able to achieve this with a combination of lazy val and Spark broadcast. It will be something like below. (Have not compiled below code, you may have to change few things)

object ProcessManager {
  lazy val start = // start your process here.

You can broadcast this object at the start of your application before you do any transformations.

val pm = sc.broadcast(ProcessManager)

Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.

rdd.mapPartition(itr => {
  // Other stuff here.