satyambansal117 satyambansal117 - 1 month ago 16
Scala Question

Data transformation in Spark Scala

I have the following dataframe

+-----+-----+-----+
|item1|item2|item3........ | itemN
+-----+-----+-----+......--|------+
| v1| v2| v3........ | vN---+
| v4| v5| v6........ | v2N--+
+-----+-----+-----+.......-|------+


here item1 , item2 and item3 are the column names and table consists of 1 row v1,v2,v3.

I want to transform it into

colA colB
item1 v1
item2 v2
item3 v3
. .
. .
. .


Here there are two columns lets say colA and colB and rows are as shown.

How to do this transformation in spark using scala?

The actual schema is as follows:

item1: struct (nullable = true)
| |-- l1: string (nullable = true)
| |-- pF: string (nullable = true)
| |-- pV1: string (nullable = true)
| |-- pV2: string (nullable = true)
| |-- sPs: string (nullable = true)


Here "970x250" is the column name "item1" and "level,pF,pV1,pV2,sPs" is the value "v1" corresponding to item1.

Answer

You can use explode:

import org.apache.spark.sql.functions._

input.show()
// +-----+-----+-----+
// |item1|item2|item3|
// +-----+-----+-----+
// |   v1|   v2|   v3|
// |   v4|   v5|   v6|
// +-----+-----+-----+

val columns: Array[String] = input.columns

val result = input.explode(columns.map(s => col(s)): _*) {
  r: Row => columns.zipWithIndex.map { case (name, index) => (name, r.getAs[String](index)) }
}.select($"_1" as "colA", $"_2" as "colB")

result.show()
// +-----+----+
// | colA|colB|
// +-----+----+
// |item1|  v1|
// |item2|  v2|
// |item3|  v3|
// |item1|  v4|
// |item2|  v5|
// |item3|  v6|
// +-----+----+
Comments