Julias Julias - 1 month ago 10
Scala Question

How to override setup and cleanup methods in spark map function

Suppose there is following map reduce job

Mapper:

setup() initializes some state

map() add data to state, no output

cleanup() ouput state to context

Reducer:

aggregare all states into one output

How such job could be implemented in spark?

Additional question: how such job could be implemented in scalding?
I'm looking for example wich somehow makes the method overloadings...

Answer

Spark map doesn't provide an equivalent of Hadoop setup and cleanup. It assumes that each call is independent and side effect free.

The closest equivalent you can get is to put required logic inside mapPartitions or mapPartitionsWithIndex with simplified template:

rdd.mapPartitions { iter => {
   ... // initalize state
   val result = ??? // compute result for iter
   ... // perform cleanup
   ... // return results as an Iterator[U]
}}