GameOfThrows - 5 months ago 36

Scala Question

I have the following process which takes a list of Strings and generates combinations of it:

`val a = List(("a","a"),("a","b"),("a","c"),("b","a"),("b","b"),("b","c"),("c","a"),("c","b"),("c","c"));`

and I am trying to generate a list of combinations of 3 (because 3 is the number of distinct letters in the set) where each member left is only mapped to 1 distinct member on the right and vise-versa.

So for example, the output I expect is something like:

`List(("a","a"),("b","b"),("c","c"))`

but it cannot be something like:

`List (("a","a"),("b","a"),("a","c"))`

so I've written the following:

`val res = a`

.combinations(3)

.toList

.filter(x =>

x.map(y => y._1).distinct.size == 3

&& x.map(y => y._2).distinct.size == 3

)

which generates the correct set of answers:

`List((a,a), (b,b), (c,c))`

List((a,a), (b,c), (c,b))

List((a,b), (b,a), (c,c))

List((a,b), (b,c), (c,a))

List((a,c), (b,a), (c,b))

List((a,c), (b,b), (c,a))

but when I increase the size of a along with the number of combinations, I am hitting GC overhead. I was wondering if there is a way to do what I want without using the combination function or in a way that is more efficient? I am using Spark, so I could use any Spark function on this as well - though I don't think there is any.

Answer

Well, indeed Spark doesn't have a `combinations`

function, but you can mimic it using consecutive calls to `cartesian`

. It might not be too efficient in terms of performance, but it should prevent the memory issues you've come across and resolve the need to `collect`

(which has its own performance cost):

```
val values: RDD[(String, String)] = sc.parallelize(a)
val combinationSize = 3 // can be increased
// mimic Scala's "combination" by repeating RDD.cartesian N times:
val combinations: RDD[Set[(String, String)]] = (1 until combinationSize)
.foldLeft(values.map(Set(_))) {
case (rdd, index) => rdd.cartesian(values).map { case (set, t2) => set + t2 }.distinct
}
// removing "illegal" combinations - since we're using sets we don't need to call "distinct":
val res = combinations
.filter(_.map(_._1).size == combinationSize)
.filter(_.map(_._2).size == combinationSize)
```