mk_ mk_ - 5 days ago 6
Scala Question

spark2.0 dataframe collect muilt row as array by column

i have some dataframe like below, i want convert muilt row as an array if column value is the same

val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
+---+---+----+------+
|id1|id2|type|value2|
+---+---+----+------+
| a| b| sum| 0|
| a| b| avg| 2|
+---+---+----+------+


i want to convert it to below

+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
| a| b| 0,2| 0|
+---+---+----+------+


the printSchema should be like below

root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = true)
| |-- sum: int (nullable = true)
| |-- dc: int (nullable = true)

Answer

You can:

import org.apache.spark.sql.functions._

val data = Seq(
  ("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")

val result = data.groupBy($"id1", $"id2").agg(struct(
  first(when($"type" === "sum", $"value2"), true).alias("sum"), 
  first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))

result.show

+---+---+-----+   
|id1|id2|  agg|
+---+---+-----+
|  a|  b|[0,2]|
+---+---+-----+

result.printSchema
root
 |-- id1: string (nullable = true)
 |-- id2: string (nullable = true)
 |-- agg: struct (nullable = false)
 |    |-- sum: integer (nullable = true)
 |    |-- avg: integer (nullable = true)
Comments