progNewbie progNewbie - 2 months ago 31
Java Question

How to get element by Index in Spark RDD (Java)

I know the method rdd.first() which gives me the first element in an RDD.

Also there is the method rdd.take(num) Which gives me the first "num" elements.

But isn't there a possibility to get an element by index?



This should be possible by first indexing the RDD. The transformation zipWithIndex provides a stable indexing, numbering each element in its original order.

Given: rdd = (a,b,c)

val withIndex = rdd.zipWithIndex // ((a,0),(b,1),(c,2))

To lookup an element by index, this form is not useful. First we need to use the index as key:

val indexKey ={case (k,v) => (v,k)}  //((0,a),(1,b),(2,c))

Now, it's possible to use the lookup action in PairRDD to find an element by key:

val b = indexKey.lookup(1) // Array(b)

If you're expecting to use lookup often on the same RDD, I'd recommend to cache the indexKey RDD to improve performance.

How to do this using the Java API is an exercise left for the reader.