HackerDuck HackerDuck - 1 year ago 55
Scala Question

Efficient grouping of data in Spark

I need to perform simple grouping of data in Spark (Scala). In particular, this is my initial data:

1, a, X
1, b, Y
2, a, Y
1, a, Y

val seqs = Seq((1, "a", "X"),(1, "b", "Y"),(2, "a", "Y"),(1, "a", "Y"))

I need to group it by the first key as follows:

1, (a, X), (b, Y), (a, Y)
2, (a, Y)

My initial idia was to use
, but I read that this operation is very expensive and requires a complete reshuffle of all data.

So, what is the less expensive option to perform grouping? A concrete example would be appreciated.

Answer Source

you could potentially do something like this:

  val rdd = sc.parallelize(List((1, "a", "X"),(1, "b", "Y"),(2, "a", "Y"),(1, "a", "Y")))
  val mapping = rdd.map(x=>(x._1,List((x._2,x._3))))
  val result = mapping.reduceByKey((x,y) => (x ++ y)) 

This uses the reduceByKey, which is much more efficient than groupByKey, but the problem with all reduce process, you must end up with 1 key value pair per group. So in this case, you need to explicitly convert each of your values into Lists, so the reduce process can then merge them.

You may also consider looking at combineByKey, which uses an internal reduce process