Nicholas White Nicholas White - 3 months ago 47
Scala Question

Apache Spark: map vs mapPartitions?

What's the difference between an RDD's

map
and
mapPartitions
method? And does
flatMap
behave like
map
or like
mapPartitions
? Thanks.

(edit)
i.e. what is the difference (either semantically or in terms of execution) between

def map[A, B](rdd: RDD[A], fn: (A => B))(implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
rdd.mapPartitions({ iter: Iterator[A] => for (i <- iter) yield fn(i) }, preservesPartitioning = true)
}


And:

def map[A, B](rdd: RDD[A], fn: (A => B))(implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
rdd.map(fn)
}

Answer

What's the difference between an RDD's map and mapPartitions method?

The method map converts each element of the source RDD into a single element of the result RDD by applying a function. mapPartitions converts each partition of the source RDD into multiple elements of the result (possibly none).

And does flatMap behave like map or like mapPartitions?

Neither, flatMap works on a single element (as map) and produces multiple elements of the result (as mapPartitions).

Comments