maddie maddie - 1 month ago 6
Scala Question

Spark - Scala : Return multiple <key, value> after processing one line

I have a dataset that looks like below -

0 -- 1,2,4

1 -- 0,4

2 -- 0,4

4 -- 2,1,0


I want to read each line and transform it to something that looks like below


// for the line 0 -- 1,2,4

(0,1) <2,4>

(0,2) <1,4>

(0,4) <1,2>


// for the line 1 -- 0,4

(0,1) <4>

(1,4) <0>


// smaller number appeears first in the pair always

i.e., read each line separated on "--" delimiter. So I get 0 and 1,2,4 from line 1 of the dataset. After that, I want create pairs. For example, (0,1) which will be the key for the transformed map and its value should be 2,4.

Once this is done, I want to be able to group values by key

For example (0,1) <2,4> <4>

and intersect them to get 4.

Is it possible to do something like this? Is my approach right?

I have written the below code so far-

var mapOperation = logData.map(x=>x.split("\t")).filter(x => x.length == 2).map(x => (x(0),x(1)))
// reading file and creating the map Example - key 0 value 1,2,4

//from the first map, trying to create pairs
var mapAgainstValue = mapOperation.map{
line =>
val fromFriend = line._1
val toFriendsList = line._2.split(",")
(fromFriend -> toFriendsList)
}

val newMap = mapAgainstValue.map{
line =>
var key ="";
for(userIds <- line._2){
key =line._1.concat(","+userIds);
(key -> line._2.toList)
}

}


The problem is I am not able to call groupByKey on newMap. I am assuming there is some issue with the way I have created the map?

Appreciate any help.

Thanks.

Answer

Your problem can be solved like this :

 val inputRDD=sc.texFile("inputFile.txt")  
inputRDD.flatMap{a=>
          val list=a.split("--")
          val firstTerm=list(0)
          val secondTermAsList=list(1).split(",")
          secondTermAsList.map{b=>
          val key=if(b>firstTerm) (firstTerm,b) else (b,firstTerm)
          val value=secondTermAsList diff List(b)
          (key,value)
          }
          }

This code results in this output :

+-----+------+
|_1   |_2    |
+-----+------+
|[0,1]|[2, 4]|
|[0,2]|[1, 4]|
|[0,4]|[1, 2]|
|[0,1]|[4]   |
|[1,4]|[0]   |
|[0,2]|[4]   |
|[2,4]|[0]   |
|[2,4]|[1, 0]|
|[1,4]|[2, 0]|
|[0,4]|[2, 1]|
+-----+------+

I hope this solves your issue !

Comments