StrongYoung StrongYoung - 1 month ago 10
Scala Question

how to output multiple (key,value) in spark map function

The format of input data likes below:

+--------------------+-------------+--------------------+
| StudentID| Right | Wrong |
+--------------------+-------------+--------------------+
| studentNo01 | a,b,c | x,y,z |
+--------------------+-------------+--------------------+
| studentNo02 | c,d | v,w |
+--------------------+-------------+--------------------+


And the format of output likes below():

+--------------------+---------+
| key | value|
+--------------------+---------+
| studentNo01,a | 1 |
+--------------------+---------+
| studentNo01,b | 1 |
+--------------------+---------+
| studentNo01,c | 1 |
+--------------------+---------+
| studentNo01,x | 0 |
+--------------------+---------+
| studentNo01,y | 0 |
+--------------------+---------+
| studentNo01,z | 0 |
+--------------------+---------+
| studentNo02,c | 1 |
+--------------------+---------+
| studentNo02,d | 1 |
+--------------------+---------+
| studentNo02,v | 0 |
+--------------------+---------+
| studentNo02,w | 0 |
+--------------------+---------+


The Right means 1 , The Wrong means 0.

I want to process these data using Spark map function or udf, But I don't know how to deal with it . Can you help me, please? Thank you.

Answer

Use split and explode twice and do the union

val df = List(
  ("studentNo01","a,b,c","x,y,z"),
  ("studentNo02","c,d","v,w")
  ).toDF("StudenID","Right","Wrong")

+-----------+-----+-----+
|   StudenID|Right|Wrong|
+-----------+-----+-----+
|studentNo01|a,b,c|x,y,z|
|studentNo02|  c,d|  v,w|
+-----------+-----+-----+


val pair = (
  df.select('StudenID,explode(split('Right,",")))
    .select(concat_ws(",",'StudenID,'col).as("key"))
    .withColumn("value",lit(1))
).unionAll(
  df.select('StudenID,explode(split('Wrong,",")))
    .select(concat_ws(",",'StudenID,'col).as("key"))
    .withColumn("value",lit(0))
)


+-------------+-----+
|          key|value|
+-------------+-----+
|studentNo01,a|    1|
|studentNo01,b|    1|
|studentNo01,c|    1|
|studentNo02,c|    1|
|studentNo02,d|    1|
|studentNo01,x|    0|
|studentNo01,y|    0|
|studentNo01,z|    0|
|studentNo02,v|    0|
|studentNo02,w|    0|
+-------------+-----+

You can convert to RDD as follows

val rdd = pair.map(r => (r.getString(0),r.getInt(1)))
Comments