Fabio - 4 years ago 88
Scala Question

# Scala collect function

Let's say I want to print

`duplicates`
in a list with their
`count`
. So I have
`3 options`
as shown below:

``````  def dups(dup:List[Int]) = {
//1)
println(dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => (x,ys.size) }.toSeq)
//2)
println(dup.groupBy(identity).collect { case (x, List(_, _, _*)) => x }.map(x => (x, dup.count(y => x == y))))
//3)
println(dup.distinct.map((a:Int) => (a, dup.count((b:Int) => a == b )) ).filter( (pair: (Int,Int) ) => { pair._2 > 1 } ))

}
``````

Questions:

-> For
`option 2`
, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in
`option 1`
using ys.size?

-> For
`option 1`
, is there any way to avoid the last call to toSeq to return a List?

-> which one of the 3 choices is
`more efficient by using the least amount of loops`
?

As an example input: List(1,1,1,2,3,4,5,5,6,100,101,101,102)
Should print: List((1,3), (5,2), (101,2))

Based on @lutzh answer below the best way would be to do the following:

``````val list: List[(Int, Int)] = dup.groupBy(identity).collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) })(breakOut)
val list2: List[(Int, Int)] = dup.groupBy(identity).collect { case (x, ys) if ys.lengthCompare(1) > 0 => (x, ys.size) }(breakOut)
``````

For option 1 is there any way to avoid the last call to toSeq to return a List?

`collect` takes a `CanBuildFrom`, so if you assign it to something of the desired type you can use breakOut:

``````import collection.breakOut
val dups: List[(Int,Int)] =
dup
.groupBy(identity)
.collect({ case (x,ys) if ys.size > 1 => (x,ys.size)} )(breakOut)
``````

`collect` will create a new collection (just like `map`), using a `Builder`. Usually the return type is determined by the origin type. With breakOut you basically ignore the origin type and look for a builder for the result type. So when `collect` creates the resulting collection, it will already create the "right" type, and you don't have to traverse the result again to convert it.

For option 2, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in option 1 using ys.size?

Yes, you can bind it to a variable with @

``````val dups: List[(Int,Int)] =
dup
.groupBy(identity)
.collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) } )(breakOut)
``````

which one of the 3 choices is more efficient?

Calling dup.count on a match seems inefficient, as dup needs to be traversed again then, I'd avoid that.

My guess would be that the guard (if lengthCompare(1) > 0) takes a few cycles less than the List(,,_*) pattern, but I haven't measured. And am not planning to.

Disclaimer: There may be a completely different (and more efficient) way of doing it that I can't think of right now. I'm only answering your specific questions.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download