zizouraj zizouraj - 1 month ago 9
Scala Question

How to rewrite below code so that I get expected output

The objective is to read list of known files from amazon s3 and create a single file in s3 at some output path. Each file is tab separated. I have to extract first element from each line and and assign it a numeric value in increasing order. Numeric value and element should be tab separated in the new file which would be created. I am using spark with scala to perform operations on RDDs.

Expected output



1 qwerty

2 asdf

...

...

67892341 ghjk

Current output



1 qwerty

2 asdf

...

...

456721 tyui

1 sdfg

2 swerr

...

...

263523 gkfk

...

...

512346 ghjk

So, basically as the computation is happening on distributed cluster, the global variable
counter
is getting initiated on each machine. How can I rewrite the code so that I get the desired output. Below is the code snippet.

def getReqCol() = {
val myRDD = sc.textFile("s3://mybucket/fileFormatregex")
var counter = 0
val mbLuidCol = myRDD.map(x => x.split("\t")).map(col =>col(0)).map(row => {
def inc(acc : Int) = {
counter = acc + 1
}
inc(counter)
counter + "\t" + row
})
row.repartition(1).saveAsTextFile("s3://mybucket/outputPath")
}

Answer

Looks like all you need is RDD.zipWithIndex():

val myRDD = 
  sc
   .textFile("s3://mybucket/fileFormatregex")
   .map(col => col(0))
   .zipWithIndex()
   .map(_.swap)
   .sortByKey(true)
   .repartition(1)
   .saveAsTextFile("s3://mybucket/outputPath")
Comments