iamseiko iamseiko - 1 month ago 5
Scala Question

Aggregating arrays in an RDD by index

I have data that is in a specific format, where each element in the RDD is an array of arrays. The first element in an array is the key, and the two elements after them are values associated with that key. How can I aggregate these values by the first array index?

This is a sample input:

Array[Array[Any]] = Array(Array(490, [490], 23225), Array(64, [64], 48262), Array(64, [64,11], 30677), Array(64, [64,11,6], 29865), Array(64, [64,3], 21175), Array(64, [64,6], 39697), Array(6, [6], 601374), Array(77, [77], 43454), Array(77, [77,11], 30409), Array(77, [77,11,6], 29830), Array(77, [77,6], 37654), Array(3, [3], 450031), Array(3, [3,6], 265180), Array(69, [69], 22631), Array(69, [69,6], 20439), Array(11, [11], 364065), Array(11, [11,3], 161286), Array(11, [11,3,6], 143682), Array(11, [11,6], 324013), Array(90, [90], 22062), Array(90, [90,6], 21288), Array(2, [2], 42927), Array(2, [2,11], 20826), Array(2, [2,6], 29619), Array(215, [215], 21592), Array(138, [138], 35127), Array(138, [138,11], 21566), Array(138, [138,11,6], 21008), Array(138, [138,6], 28750), Array(1, [...


I want all of the arrays that have key 490 to be grouped together, and those that have key 64 to be together, and so forth.

Answer

You can use the group by operator:

arr.groupBy(_.head) or longform arr.groupBy(innerArr => innerArr.head)

Array(Array(400, "sad", "sd"), Array(300, "aa", "sd"), Array(400, "dsa", "asd")) 
    .groupBy(_.head)
res0: Map[Any, Array[Array[Any]]] = Map(
  400 -> Array(Array(400, sad, sd), Array(400, dsa, asd)),
  300 -> Array(Array(300, aa, sd))
)

If you don't want the key to remain in the value list, you can map over the values using mapValues to remove them like so:

arr.groupBy(_.head)
   .mapValues(_.map(_.tail))
res1: Map[Any, Array[Array[Any]]] = Map(
  400 -> Array(Array(sad, sd), Array(dsa, asd)), 
  300 -> Array(Array(aa, sd))
)
Comments