Drakax Drakax - 5 months ago 115
Scala Question

Scala RDD count by range

I need to "extract" some data contained in an Iterable[MyObject] (it was a RDD[MyObject] before a groupBy).

My initial RDD[MyObject] :

|-----------|---------|----------|
| startCity | endCity | Customer |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 1 | 1 |
| | |----|-----|
| | | 2 | 1 |
| | |----|-----|
| | | 3 | 50 |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 5 | 40 |
| | |----|-----|
| | | 6 | 41 |
| | |----|-----|
| | | 7 | 2 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 9 | 15 |
| | |----|-----|
| | | 10| 16 |
| | |----|-----|
| | | 11| 46 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 13| 7 |
| | |----|-----|
| | | 14| 9 |
| | |----|-----|
| | | 15| 60 |
|-----------|---------|----|-----|
| Barcelona | London | ID | Age |
| | |----|-----|
| | | 17| 66 |
| | |----|-----|
| | | 18| 53 |
| | |----|-----|
| | | 19| 11 |
|-----------|---------|----|-----|


I need to count them by age range by and groupBy startCity - endCity

The final result should be :

|-----------|---------|-------------|
| startCity | endCity | Customer |
|-----------|---------|-------------|
| Paris | London | Range| Count|
| | |------|------|
| | |0-2 | 3 |
| | |------|------|
| | |3-18 | 0 |
| | |------|------|
| | |19-99 | 3 |
|-----------|---------|-------------|
| New-York | Paris | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 3 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
| Barcelona | London | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 1 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|


At the moment I'm doing this by count 3 times the same data (first time with 0-2 range, then 10-20, then 21-99).

Like :

Iterable[MyObject] ite

ite.count(x => x.age match {
case Some(age) => { age >= 0 && age < 2 }
}


It's working by giving me an Integer but not efficient at all I think since I have to count many times, what's the best way to do this please ?

Thanks

EDIT : The Customer object is a case class

Oli Oli
Answer Source
def computeRange(age : Int) = 
    if(age<=2)
        "0-2"
    else if(age<=10)
        "2-10"
    // etc, you get the idea

Then, with an RDD of case class MyObject(id : String, age : Int)

rdd
   .map(x=> computeRange(x.age) -> 1)
   .reduceByKey(_+_)

Edit: If you need to group by some columns, you can do it this way, provided that you have a RDD[(SomeColumns, Iterable[MyObject])]. The following lines would give you a map that associates each "range" to its number of occurences.

def computeMapOfOccurances(list : Iterable[MyObject]) : Map[String, Int] =
    list
        .map(_.age)
        .map(computeRange)
        .groupBy(x=>x)
        .mapValues(_.size)

val result1 = rdd
    .mapValues( computeMapOfOccurances(_))

And if you need to flatten your data, you can write:

val result2 = result1
    .flatMapValues(_.toSeq)    
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download