GameOfThrows GameOfThrows - 1 month ago 14
Scala Question

scala combinations function hitting GC overhead

I have the following process which takes a list of Strings and generates combinations of it:

val a = List(("a","a"),("a","b"),("a","c"),("b","a"),("b","b"),("b","c"),("c","a"),("c","b"),("c","c"));


and I am trying to generate a list of combinations of 3 (because 3 is the number of distinct letters in the set) where each member left is only mapped to 1 distinct member on the right and vise-versa.

So for example, the output I expect is something like:

List(("a","a"),("b","b"),("c","c"))


but it cannot be something like:

List (("a","a"),("b","a"),("a","c"))


so I've written the following:

val res = a
.combinations(3)
.toList
.filter(x =>
x.map(y => y._1).distinct.size == 3
&& x.map(y => y._2).distinct.size == 3
)


which generates the correct set of answers:

List((a,a), (b,b), (c,c))
List((a,a), (b,c), (c,b))
List((a,b), (b,a), (c,c))
List((a,b), (b,c), (c,a))
List((a,c), (b,a), (c,b))
List((a,c), (b,b), (c,a))


but when I increase the size of a along with the number of combinations, I am hitting GC overhead. I was wondering if there is a way to do what I want without using the combination function or in a way that is more efficient? I am using Spark, so I could use any Spark function on this as well - though I don't think there is any.

Answer

Well, indeed Spark doesn't have a combinations function, but you can mimic it using consecutive calls to cartesian. It might not be too efficient in terms of performance, but it should prevent the memory issues you've come across and resolve the need to collect (which has its own performance cost):

val values: RDD[(String, String)] = sc.parallelize(a)
val combinationSize = 3 // can be increased

// mimic Scala's "combination" by repeating RDD.cartesian N times:
val combinations: RDD[Set[(String, String)]] = (1 until combinationSize)
  .foldLeft(values.map(Set(_))) {
    case (rdd, index) => rdd.cartesian(values).map { case (set, t2) => set + t2 }.distinct
  }

// removing "illegal" combinations - since we're using sets we don't need to call "distinct": 
val res = combinations
  .filter(_.map(_._1).size == combinationSize)
  .filter(_.map(_._2).size == combinationSize)