user1189851 user1189851 - 1 month ago 17
Scala Question

Add a column with a rank to an rdd in Spark Scala

Unfortunately we still have to use spark 1.0.0 and need to work with rdds.
I have a rdd that is created from a csv file.

val serialRDD= sc.textFile(path)


If we print each line of the rdd, we get something like this: an id and a string.

1929 abc
2384 def
8753 ghi
3893 jkl


I want to be able to add another column being another id, which is going to be a string like "SERIAL-" where RANK would be 1,2,3 etc autoincrementing by 1

The output should be like:

1929 abc SERIAL-1
2384 def SERIAL-2
8753 ghi SERIAL-3
3893 jkl SERIAL-3


How could I get this done using RDD? Thanks a lot in advance for the help.

Answer

You can use zipWithIndex and map to get it done :

serialRDD.zipWithIndex.map{ case (r, i) => (r._1, r._2, s"SERIAL-${i+1}") }

I used string interpolation to get the SERIAL-X string. I also incremented the index because zipWithIndex starts at the index 0.

Comments