HackerDuck - 7 months ago 28

Scala Question

I need to perform simple grouping of data in Spark (Scala). In particular, this is my initial data:

`1, a, X`

1, b, Y

2, a, Y

1, a, Y

val seqs = Seq((1, "a", "X"),(1, "b", "Y"),(2, "a", "Y"),(1, "a", "Y"))

I need to group it by the first key as follows:

`1, (a, X), (b, Y), (a, Y)`

2, (a, Y)

My initial idia was to use

`DataFrame`

`groupBy`

So, what is the less expensive option to perform grouping? A concrete example would be appreciated.

Answer

you could potentially do something like this:

```
val rdd = sc.parallelize(List((1, "a", "X"),(1, "b", "Y"),(2, "a", "Y"),(1, "a", "Y")))
val mapping = rdd.map(x=>(x._1,List((x._2,x._3))))
val result = mapping.reduceByKey((x,y) => (x ++ y))
```

This uses the reduceByKey, which is much more efficient than groupByKey, but the problem with all reduce process, you must end up with 1 key value pair per group. So in this case, you need to explicitly convert each of your values into Lists, so the reduce process can then merge them.

You may also consider looking at combineByKey, which uses an internal reduce process