Julias Julias - 11 months ago 83
Scala Question

How to override setup and cleanup methods in spark map function

Suppose there is following map reduce job


setup() initializes some state

map() add data to state, no output

cleanup() ouput state to context


aggregare all states into one output

How such job could be implemented in spark?

Additional question: how such job could be implemented in scalding?
I'm looking for example wich somehow makes the method overloadings...

Answer Source

Spark map doesn't provide an equivalent of Hadoop setup and cleanup. It assumes that each call is independent and side effect free.

The closest equivalent you can get is to put required logic inside mapPartitions or mapPartitionsWithIndex with simplified template:

rdd.mapPartitions { iter => {
   ... // initalize state
   val result = ??? // compute result for iter
   ... // perform cleanup
   ... // return results as an Iterator[U]