Simon Kim Simon Kim - 1 year ago 109
Scala Question

transform the RDD with list column , into multiple rows in Spark

Hi I have a RDD table like (with case class userInfo(userID: Long, day: String, prodIDList: String) )


userA, 2016-10-12, [10000, 100001]

userB, 2016-10-13, [9999, 1003]

userC, 2016-10-13, [8888, 1003,2000]

And I want to transform this into like ,


userA, 2016-10-12, 10000

userA, 2016-10-12,100001

userB, 2016-10-13,9999

userB, 2016-10-1003

userC, 2016-10-13, 8888

userC, 2016-10-13, 1003

userC, 2016-10-13, 2000

Anyone has ideas how I can do this by using RDD command in Spark??

When I look at the relevant post in stack overflow in Spark RDD mapping one row of data into multiple rows

it suggests me to use flatmap, but I don't know how to apply this to my case because I am spark beginner.

Thanks in advance.

Answer Source

Try this:

val data = sc.parallelize(Array(("userA", "2016-10-12", Array(10000, 100001)),
             ("userB", "2016-10-13", Array(9999, 1003)),
             ("userC", "2016-10-13", Array(8888, 1003,2000))))
val resultRDD ={ case (a, b, c) => ((a, b), c)
}.flatMapValues(x => x).map{ case ((a, b), c) => (a, b, c)}
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download