jyshin jyshin - 7 months ago 49
Scala Question

Is spark RDD.fold method buggy?

I found that spark RDD.fold and scala List.fold behave differently with same input.

Scala 2.11.8

List(1, 2, 3, 4).fold(1)(_ + _) // res0: Int = 11

I think this is correct output because 1 + (1 + 2 + 3 + 4) equals 11. But spark RDD.fold looks buggy.

Spark 2.0.1(not clustered)

sc.parallelize(List(1, 2, 3, 4)).fold(1)(_ + _) // res0: Int = 15

Although RDD is not a simple collection, this result does not make sense. Is this a known bug or normal result?


It is not buggy, you're just not using in the right way. zeroElement should be neutral, it means that it has to satisfy following condition:

op(x, zeroValue) === op(zeroValue, x) === x 

If op is + then the right choice is 0.

Why restriction like this? If fold is to be executed in parallel each chunk will have to initialize its own zeroValue. In a more formal way you can think about Monoid where:

  • op is equivalent to •
  • zeroElement is equivalent to identity element.