Thabby07 Thabby07 - 3 months ago 41
Scala Question

Calculating a variable inside RDD after full outer join in Scala

What I want to do is simpl, but I struggle with Scala and RDDs.
The concept is this:

rdd1 rdd2

id count id count
a 2 a 1
b 1 c 5
d 3

And the result I am searching for is this:


id count
a 3
b 1
c 5
d 3

what I intend to do is to perform a full outer join to get common and non common registers, identified by the id field. For now, rdd2, is empty.
rdd1 and rdd2 are:

RDD[(String, org.apache.spark.sql.Row)]

For now, I have the following code:

var rdd3 = rdd1.fullOuterJoin(rdd2).map {
case (id, left, right) =>

How can I calculate that sum between RDDs?


If you are doing a fullOuterJoin you get the key and two Options passed into the closure (one Option represents the left side, the other one the right side). So the closure could look like this:

val result = rdd1.fullOuterJoin(rdd2).map {
  case (id, (left, right)) =>
    (id, left.getOrElse(0) + right.getOrElse(0))

This applies if your RDD is of type (String, Int).