Thabby07 Thabby07 - 2 months ago 23
Scala Question

Calculating a variable inside RDD after full outer join in Scala

What I want to do is simpl, but I struggle with Scala and RDDs.
The concept is this:

rdd1 rdd2

id count id count
a 2 a 1
b 1 c 5
d 3


And the result I am searching for is this:

rdd2

id count
a 3
b 1
c 5
d 3


what I intend to do is to perform a full outer join to get common and non common registers, identified by the id field. For now, rdd2, is empty.
rdd1 and rdd2 are:

RDD[(String, org.apache.spark.sql.Row)]


For now, I have the following code:

var rdd3 = rdd1.fullOuterJoin(rdd2).map {
case (id, left, right) =>
// TODO
}


How can I calculate that sum between RDDs?

Answer

If you are doing a fullOuterJoin you get the key and two Options passed into the closure (one Option represents the left side, the other one the right side). So the closure could look like this:

val result = rdd1.fullOuterJoin(rdd2).map {
  case (id, (left, right)) =>
    (id, left.getOrElse(0) + right.getOrElse(0))
}

This applies if your RDD is of type (String, Int).