iamseiko - 1 year ago 67
Scala Question

# Aggregating arrays in an RDD by index

I have data that is in a specific format, where each element in the RDD is an array of arrays. The first element in an array is the key, and the two elements after them are values associated with that key. How can I aggregate these values by the first array index?

This is a sample input:

``````Array[Array[Any]] = Array(Array(490, [490], 23225), Array(64, [64], 48262), Array(64, [64,11], 30677), Array(64, [64,11,6], 29865), Array(64, [64,3], 21175), Array(64, [64,6], 39697), Array(6, [6], 601374), Array(77, [77], 43454), Array(77, [77,11], 30409), Array(77, [77,11,6], 29830), Array(77, [77,6], 37654), Array(3, [3], 450031), Array(3, [3,6], 265180), Array(69, [69], 22631), Array(69, [69,6], 20439), Array(11, [11], 364065), Array(11, [11,3], 161286), Array(11, [11,3,6], 143682), Array(11, [11,6], 324013), Array(90, [90], 22062), Array(90, [90,6], 21288), Array(2, [2], 42927), Array(2, [2,11], 20826), Array(2, [2,6], 29619), Array(215, [215], 21592), Array(138, [138], 35127), Array(138, [138,11], 21566), Array(138, [138,11,6], 21008), Array(138, [138,6], 28750), Array(1, [...
``````

I want all of the arrays that have key 490 to be grouped together, and those that have key 64 to be together, and so forth.

You can use the group by operator:

`arr.groupBy(_.head)` or longform `arr.groupBy(innerArr => innerArr.head)`

``````Array(Array(400, "sad", "sd"), Array(300, "aa", "sd"), Array(400, "dsa", "asd"))
res0: Map[Any, Array[Array[Any]]] = Map(
400 -> Array(Array(400, sad, sd), Array(400, dsa, asd)),
300 -> Array(Array(300, aa, sd))
)
``````

If you don't want the key to remain in the value list, you can map over the values using mapValues to remove them like so:

``````arr.groupBy(_.head)
.mapValues(_.map(_.tail))
res1: Map[Any, Array[Array[Any]]] = Map(
400 -> Array(Array(sad, sd), Array(dsa, asd)),
300 -> Array(Array(aa, sd))
)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download