mongolol mongolol - 1 month ago 12
Scala Question

Funcitonal Opposite of Subtract by Key

I have two RDDs of the form RDD1[K, V1] and RDD2[K, V2]. I was hoping to remove values in RDD2 which are not in RDD1. (Essentially an inner join on each of the RDD's keys, but I don't want to copy over RDD1's values.)

I understand that there's a method

subtractByKey
which performs the opposite of this. (Keeps those that are distinct.)

Answer

You cannot avoid having some type of value here so applying join and mapping values seems to be the way to go. You can use:

rdd2.join(rdd1.mapValues(_ => None)).mapValues(_._1)

which replaces values with dummies (usually you can skip that because there is not much to gain here unless values are largish):

_.mapValues(_ => None)

joins, and drops placeholders:

_.mapValues(_._1)