GameOfThrows - 1 year ago 104
Scala Question

# scala combinations function hitting GC overhead

I have the following process which takes a list of Strings and generates combinations of it:

``````val a = List(("a","a"),("a","b"),("a","c"),("b","a"),("b","b"),("b","c"),("c","a"),("c","b"),("c","c"));
``````

and I am trying to generate a list of combinations of 3 (because 3 is the number of distinct letters in the set) where each member left is only mapped to 1 distinct member on the right and vise-versa.

So for example, the output I expect is something like:

``````List(("a","a"),("b","b"),("c","c"))
``````

but it cannot be something like:

``````List (("a","a"),("b","a"),("a","c"))
``````

so I've written the following:

``````val res = a
.combinations(3)
.toList
.filter(x =>
x.map(y => y._1).distinct.size == 3
&&  x.map(y => y._2).distinct.size == 3
)
``````

which generates the correct set of answers:

``````List((a,a), (b,b), (c,c))
List((a,a), (b,c), (c,b))
List((a,b), (b,a), (c,c))
List((a,b), (b,c), (c,a))
List((a,c), (b,a), (c,b))
List((a,c), (b,b), (c,a))
``````

but when I increase the size of a along with the number of combinations, I am hitting GC overhead. I was wondering if there is a way to do what I want without using the combination function or in a way that is more efficient? I am using Spark, so I could use any Spark function on this as well - though I don't think there is any.

Well, indeed Spark doesn't have a `combinations` function, but you can mimic it using consecutive calls to `cartesian`. It might not be too efficient in terms of performance, but it should prevent the memory issues you've come across and resolve the need to `collect` (which has its own performance cost):

``````val values: RDD[(String, String)] = sc.parallelize(a)
val combinationSize = 3 // can be increased

// mimic Scala's "combination" by repeating RDD.cartesian N times:
val combinations: RDD[Set[(String, String)]] = (1 until combinationSize)
.foldLeft(values.map(Set(_))) {
case (rdd, index) => rdd.cartesian(values).map { case (set, t2) => set + t2 }.distinct
}

// removing "illegal" combinations - since we're using sets we don't need to call "distinct":
val res = combinations
.filter(_.map(_._1).size == combinationSize)
.filter(_.map(_._2).size == combinationSize)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download