Let's say I want to print
duplicates
count
3 options
def dups(dup:List[Int]) = {
//1)
println(dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => (x,ys.size) }.toSeq)
//2)
println(dup.groupBy(identity).collect { case (x, List(_, _, _*)) => x }.map(x => (x, dup.count(y => x == y))))
//3)
println(dup.distinct.map((a:Int) => (a, dup.count((b:Int) => a == b )) ).filter( (pair: (Int,Int) ) => { pair._2 > 1 } ))
}
option 2
option 1
option 1
more efficient by using the least amount of loops
val list: List[(Int, Int)] = dup.groupBy(identity).collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) })(breakOut)
val list2: List[(Int, Int)] = dup.groupBy(identity).collect { case (x, ys) if ys.lengthCompare(1) > 0 => (x, ys.size) }(breakOut)
For option 1 is there any way to avoid the last call to toSeq to return a List?
collect
takes a CanBuildFrom
, so if you assign it to something of the desired type you can use breakOut:
import collection.breakOut
val dups: List[(Int,Int)] =
dup
.groupBy(identity)
.collect({ case (x,ys) if ys.size > 1 => (x,ys.size)} )(breakOut)
collect
will create a new collection (just like map
), using a Builder
. Usually the return type is determined by the origin type. With breakOut you basically ignore the origin type and look for a builder for the result type. So when collect
creates the resulting collection, it will already create the "right" type, and you don't have to traverse the result again to convert it.
For option 2, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in option 1 using ys.size?
Yes, you can bind it to a variable with @
val dups: List[(Int,Int)] =
dup
.groupBy(identity)
.collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) } )(breakOut)
which one of the 3 choices is more efficient?
Calling dup.count on a match seems inefficient, as dup needs to be traversed again then, I'd avoid that.
My guess would be that the guard (if lengthCompare(1) > 0) takes a few cycles less than the List(,,_*) pattern, but I haven't measured. And am not planning to.
Disclaimer: There may be a completely different (and more efficient) way of doing it that I can't think of right now. I'm only answering your specific questions.