Click Upvote Click Upvote - 29 days ago 6x
Java Question

Spark scheduling / architecture confusion

I'm attempting to setup a Spark cluster, using the standalone / internal Spark cluster (not Yarn or Mesos). I'm trying to understand how things need to be architected.

Here's my understanding:

  • One node needs to be setup as the Master

  • One or more nodes need to be setup as Workers

  • The application I write (in Java) will be passed the ip:port of the Master to create the spark context

  • When I run any code on in the java app, on the spark context, e.g a filter/collect, that code will automatically be run across the worker nodes.

My questions are:

  • Do I need to setup a separate server/node for running the driver, or can/should it be run from the master/one of the worker nodes?

  • If I wanted my filter/collect code to run at regular intervals, do i need to take care of the scheduling myself from within the driver?

  • Edit: It looks like the recommended way of submitting jobs is thru a bash script? That seems like a manual process. How does this get handled in production?

  1. You can run your application from non-worker node - it's called client mode. If you run your application inside some worker node, it's called cluster mode. Both of them are possible.

  2. Please take a look at Spark Streaming, it seems that it will fits your requirements. You can specify that every one hour data will be collected and computation will start. You can also create cron task that will be executing spark-submit.

  3. Yes, recommended way if throught spark-submit script. You can however run this script from cron jobs, from Marathon, Oozie. It very depends on what you want to do.

If you want more info please write more about your use case, I'll try to update my answer with more precise information

Update after comment: I recommend looking at Spark Streaming - it has connector to Kafka and you can write aggregations or custom processing, via foreachRDD, to data received from specific topics. Pseudocode of algorithm:

val ssc = new StreamingContext(sparkConf, Seconds(2))
val directKafkaStream = KafkaUtils.createDirectStream[
     [key class], [value class], [key decoder class], [value decoder class] ](
     streamingContext, [map of Kafka parameters], [set of topics to consume])
val topicFirst = directKafkaStream.filter (_._1 == "topic1")
val topic2 = directKafkaStream.filter (_._1 == "topic2")

topicFirst.foreachRDD (rdd => {
    // do some processing with data collected from specified time window

About cron, you can ivoke nohup with spark-submit. However it's better to have one long-running jobs than many small if you must execute them in small time intervals. However it seems Spark Streaming will be good for you, them you'll have one long running job. Mandatory Word Count example is here :)