Sonex Sonex - 7 months ago 74
Scala Question

Apache Spark - Intersection of Multiple RDDs

In apache spark one can union multiple RDDs efficiently by using

method. Is there something similar if someone wants to intersect multiple RDDs? I have searched in sparkContext methods and I could not find anything or anywhere else. One solution could be to union the rdds and then retrieve the duplicates, but I do not think it could be that efficient. Assuming I have the following example with key/value pair collections:

val rdd1 = sc.parallelize(Seq((1,2.0),(1,1.0),(2,1.0)))
val rdd2 = sc.parallelize(Seq((1,2.0),(3,4.0),(3,1.0)))

I want to retrieve a new collection which has the following elements:

(1,2.0) (1,1.0) (1,2.0)

But of course for multiple rdds and not just two.



val rdds = Seq(
  sc.parallelize(Seq(1, 3, 5)),
  sc.parallelize(Seq(3, 5)),
  sc.parallelize(Seq(1, 3))
) => => (x, None))).reduce((x, y) => x.join(y) => (x, None))).keys