mongolol mongolol - 1 month ago 8
Scala Question

Reducing a List of Case Classes to a Count of the Case Classes

I currently have a group RDD of the form

((id, code), (list of events with keys id and code))
. Looking below, the ID is
000406106-01
, the code is
496
, and the individual events are each
Diagnostic
case class. What I was hoping to do was obtain an RDD of the form
((id, code), count of events)
. Essentially, I wanted to collapse the
CompactBuffer
object of
Diagnostic
events into a count of of the events. Any suggestions?

ID CODE EVENT1 EVENT2
((000406106-01,496),CompactBuffer(Diagnostic(000406106-01,Sun Apr 16 02:24:00 UTC 2006,496), Diagnostic(000406106-01,Fri Jul 20 15:30:00 UTC 2012,496), Diagnostic(000406106-01,Tue Dec 23 17:00:00 UTC 2014,496), Diagnostic(000406106-01,Wed Jan 06 20:45:00 UTC 2010,496), Diagnostic(000406106-01,Fri Mar 04 16:30:00 UTC 2011,496), Diagnostic(000406106-01,Sun Aug 04 04:51:00 UTC 2013,496), Diagnostic(000406106-01,Fri Mar 11 16:00:00 UTC 2011,496), Diagnostic(000406106-01,Tue Jul 10 13:45:00 UTC 2012,496), Diagnostic(000406106-01,Wed Jun 15 20:00:00 UTC 2005,496), Diagnostic(000406106-01,Tue Dec 29 13:30:00 UTC 2009,496), Diagnostic(000406106-01,Fri Jul 13 13:30:00 UTC 2012,496), Diagnostic(000406106-01,Thu Jul 26 03:40:00 UTC 2007,496), Diagnostic(000406106-01,Mon Jun 13 14:45:00 UTC 2005,496), Diagnostic(000406106-01,Wed Dec 24 18:00:00 UTC 2014,496), Diagnostic(000406106-01,Thu Mar 03 15:45:00 UTC 2011,496), Diagnostic(000406106-01,Wed Dec 31 15:00:00 UTC 2014,496), Diagnostic(000406106-01,Sat Jul 26 04:39:00 UTC 2008,496), Diagnostic(000406106-01,Thu Dec 31 20:30:00 UTC 2009,496)))


What I'm looking for:

ID CODE COUNT
((000406106-01,496), 20)


Edit: For clarity's sake, here's how the RDD above is being generated:

val grpDiag = diagnostic.groupBy(diag => (diag.id, diag.code))


Where diagnostic is an ungrouped RDD of the above data.

Answer

If the second element of the tuple is a CompactBuffer and all you need is its length a mapValues with _.size should give you the required result:

rdd.mapValues(_.size)

In general you should avoid grouping just to find a count and use reduceByKey as a replacement:

val diagnostics: RDD[Diagnostic] = ???
diagnostics.map(d => ((d.id, d.code), 1L)).reduceByKey(_ + _)