Fabio Fabio - 4 years ago 88
Scala Question

Scala collect function

Let's say I want to print

duplicates
in a list with their
count
. So I have
3 options
as shown below:

def dups(dup:List[Int]) = {
//1)
println(dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => (x,ys.size) }.toSeq)
//2)
println(dup.groupBy(identity).collect { case (x, List(_, _, _*)) => x }.map(x => (x, dup.count(y => x == y))))
//3)
println(dup.distinct.map((a:Int) => (a, dup.count((b:Int) => a == b )) ).filter( (pair: (Int,Int) ) => { pair._2 > 1 } ))

}


Questions:

-> For
option 2
, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in
option 1
using ys.size?

-> For
option 1
, is there any way to avoid the last call to toSeq to return a List?

-> which one of the 3 choices is
more efficient by using the least amount of loops
?

As an example input: List(1,1,1,2,3,4,5,5,6,100,101,101,102)
Should print: List((1,3), (5,2), (101,2))

Based on @lutzh answer below the best way would be to do the following:

val list: List[(Int, Int)] = dup.groupBy(identity).collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) })(breakOut)
val list2: List[(Int, Int)] = dup.groupBy(identity).collect { case (x, ys) if ys.lengthCompare(1) > 0 => (x, ys.size) }(breakOut)

Answer Source

For option 1 is there any way to avoid the last call to toSeq to return a List?

collect takes a CanBuildFrom, so if you assign it to something of the desired type you can use breakOut:

import collection.breakOut
val dups: List[(Int,Int)] = 
    dup
    .groupBy(identity)
    .collect({ case (x,ys) if ys.size > 1 => (x,ys.size)} )(breakOut)

collect will create a new collection (just like map), using a Builder. Usually the return type is determined by the origin type. With breakOut you basically ignore the origin type and look for a builder for the result type. So when collect creates the resulting collection, it will already create the "right" type, and you don't have to traverse the result again to convert it.

For option 2, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in option 1 using ys.size?

Yes, you can bind it to a variable with @

val dups: List[(Int,Int)] = 
    dup
    .groupBy(identity)
    .collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) } )(breakOut)

which one of the 3 choices is more efficient?

Calling dup.count on a match seems inefficient, as dup needs to be traversed again then, I'd avoid that.

My guess would be that the guard (if lengthCompare(1) > 0) takes a few cycles less than the List(,,_*) pattern, but I haven't measured. And am not planning to.

Disclaimer: There may be a completely different (and more efficient) way of doing it that I can't think of right now. I'm only answering your specific questions.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download