SiLaf SiLaf - 5 months ago 140
Java Question

Spark: keyBy() vs mapToPair()

In Java Spark, I could use either keyBy() or mapToPair() to create some key for a JavaRDD. Using keyBy() makes my intentions more clear and takes an argument function with a bit less code (the function returns a key rather than a tuple). However is there any improvement in performance in using keyBy() over mapToPair()? Thanks


You can browse the difference in the source:

def mapToPair[K2, V2](f: PairFunction[T, K2, V2]): JavaPairRDD[K2, V2] = {
  def cm: ClassTag[(K2, V2)] = implicitly[ClassTag[(K2, V2)]]
  new JavaPairRDD([(K2, V2)](f)(cm))(fakeClassTag[K2], fakeClassTag[V2])


def keyBy[U](f: JFunction[T, U]): JavaPairRDD[U, T] = {
  implicit val ctag: ClassTag[U] = fakeClassTag

Which calls:

def keyBy[K](f: T => K): RDD[(K, T)] = withScope {
    val cleanedF = sc.clean(f)
    map(x => (cleanedF(x), x))

They basically both call map and generate a new RDD. I see no significant differences between the two.