ScalaNewb ScalaNewb - 1 year ago 37
Scala Question

Sort by a key, but value has more than one element using Scala

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:

Year, Name, County, Number

2000, JOHN, KINGS, 50

2000, BOB, KINGS, 40

2000, MARY, NASSAU, 60

2001, JOHN, KINGS, 14

2001, JANE, KINGS, 30

2001, BOB, NASSAU, 45

And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?

I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)

val names = sc.textFile("names.csv").map(l => l.split(","))

val uniqueCounty = => x(2)).distinct.collect

for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)

Answer Source

Here is the solution using RDD. I am assuming you need top occurring name per county.

val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd =>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect


res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))