Ninja Ninja - 2 months ago 7
Scala Question

Key/Value pair RDD in Spark

I have a question on key/value pair RDD.

I have five files in the

C:/download/input
folder which has the dialogs in the films as the content of the files as follows:

movie_horror_Conjuring.txt
movie_comedy_eurotrip.txt
movie_horror_insidious.txt
movie_sci-fi_Interstellar.txt
movie_horror_evildead.txt


I am trying to read the files in the input folder using the sc.wholeTextFiles() where i get the key/value as follows

(C:/download/input/movie_horror_Conjuring.txt,values)


I am trying to do an operation where i have to group the input files of each genre together using
groupByKey()
. The values of all the horror movies together , comedy movies together and so on.

Is there any way i can generate the key/value pair this way
(horror, values)
instead of
(C:/download/input/movie_horror_Conjuring.txt,values)


val ipfile = sc.wholeTextFiles("C:/download/input")
val output = ipfile.groupByKey().map(t => (t._1,t._2))


The above code is giving me the output as follows

(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_comedy_eurotrip.txt,values)
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_sci-fi_Interstellar.txt,values)
(C:/download/input/movie_horror_evildead.txt,values)


where as i need the output as follows :

(horror, (values1, values2, values3))
(comedy, (values1))
(sci-fi, (values1))


I also tried to do some map and split operations to remove the folder paths of the key to get only the file name, but i'm not able to append the corresponding values to the files.

avr avr
Answer

Try this:

 val output = ipfile.map{case (k, v) => (k.split("_")(1),v)}.groupByKey()    
 output.collect

Let me know if this works for you!