jyshin jyshin - 9 months ago 56
Scala Question

Is spark RDD.fold method buggy?

I found that spark RDD.fold and scala List.fold behave differently with same input.

Scala 2.11.8

List(1, 2, 3, 4).fold(1)(_ + _) // res0: Int = 11

I think this is correct output because 1 + (1 + 2 + 3 + 4) equals 11. But spark RDD.fold looks buggy.

Spark 2.0.1(not clustered)

sc.parallelize(List(1, 2, 3, 4)).fold(1)(_ + _) // res0: Int = 15

Although RDD is not a simple collection, this result does not make sense. Is this a known bug or normal result?

Answer Source

It is not buggy, you're just not using in the right way. zeroElement should be neutral, it means that it has to satisfy following condition:

op(x, zeroValue) === op(zeroValue, x) === x 

If op is + then the right choice is 0.

Why restriction like this? If fold is to be executed in parallel each chunk will have to initialize its own zeroValue. In a more formal way you can think about Monoid where:

  • op is equivalent to •
  • zeroElement is equivalent to identity element.