Sonex Sonex - 1 month ago 12
Scala Question

Apache Spark - Intersection of Multiple RDDs

In apache spark one can union multiple RDDs efficiently by using

sparkContext.union()
method. Is there something similar if someone wants to intersect multiple RDDs? I have searched in sparkContext methods and I could not find anything or anywhere else. One solution could be to union the rdds and then retrieve the duplicates, but I do not think it could be that efficient. Assuming I have the following example with key/value pair collections:

val rdd1 = sc.parallelize(Seq((1,2.0),(1,1.0),(2,1.0)))
val rdd2 = sc.parallelize(Seq((1,2.0),(3,4.0),(3,1.0)))


I want to retrieve a new collection which has the following elements:

(1,2.0) (1,1.0) (1,2.0)


But of course for multiple rdds and not just two.

Answer

Try:

val rdds = Seq(
  sc.parallelize(Seq(1, 3, 5)),
  sc.parallelize(Seq(3, 5)),
  sc.parallelize(Seq(1, 3))
)
rdds.map(rdd => rdd.map(x => (x, None))).reduce((x, y) => x.join(y).keys.map(x => (x, None))).keys