Carlos Bribiescas Carlos Bribiescas - 9 months ago 35
Scala Question

Does Spark maintain Hash Functions across its cluster?

The general contract for

hashCode
says


This integer need not remain consistent from one execution of an application to another execution of the same application.


So for something like Spark, that has separate JVMs per executor, does it do anything to ensure that hash codes are consistent across the cluster?

In my experience I use things with deterministic hashes so it hasn't been a problem.

Answer Source

In my experience I use things with deterministic hashes so it hasn't been a problem.

That is indeed the way to go, Spark can't overcome usage of objects with non-deterministic hash codes.

Usage of Java Enums is a specifically notorious example of how this can go wrong, see: http://dev.bizo.com/2014/02/beware-enums-in-spark.html. Quoting that post:

... the hashCode method on Java's enum type is based on the memory address of the object. So while yes, we're guaranteed that the same enum value have a stable hashCode inside a particular JVM (since the enum will be a static object) - we don't have this guarantee when you try to compare hashCodes of Java enums with identical values living in different JVMs

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download