Ninja Ninja - 5 months ago 24
Scala Question

Key/Value pair RDD in Spark

I have a question on key/value pair RDD.

I have five files in the

folder which has the dialogs in the films as the content of the files as follows:


I am trying to read the files in the input folder using the sc.wholeTextFiles() where i get the key/value as follows


I am trying to do an operation where i have to group the input files of each genre together using
. The values of all the horror movies together , comedy movies together and so on.

Is there any way i can generate the key/value pair this way
(horror, values)
instead of

val ipfile = sc.wholeTextFiles("C:/download/input")
val output = ipfile.groupByKey().map(t => (t._1,t._2))

The above code is giving me the output as follows


where as i need the output as follows :

(horror, (values1, values2, values3))
(comedy, (values1))
(sci-fi, (values1))

I also tried to do some map and split operations to remove the folder paths of the key to get only the file name, but i'm not able to append the corresponding values to the files.

avr avr

Try this:

 val output ={case (k, v) => (k.split("_")(1),v)}.groupByKey()    

Let me know if this works for you!