kiran.kumar M kiran.kumar M - 2 months ago 6
Scala Question

De-normalise/Transpose spark dataframe

Give the following dataframe, How can the data in rows be converted to columns.
The list of property name is not defined. ie, there can be more properties than which are defined here. I am looking for a code sample in scala using in apache spark

UserCode | PropertyName | PropertyValue
1 | First Name | Ram
1 | Last Name | Shri
1 | Address | Ayodhya
2 | First Name | Laxman
2 | Lastname | Shri
2 | Address | Ayodhya
2 | Skill | Archery
2 | Mariatal Status | Married
2 | Age | 23
3 | First Name | Sita
3 | Last Name | Devi
3 | Address | Ayodhya


Expected output

UserCode | First Name | Last Name | Address | Skill | Age
1 | Ram | Shri | Ayodhya | |
2 | Laxman | Shri | Ayodhya | Archery | 23
3 | Sita | Devi | Ayodhya | |

Answer

If you're okay using pivot, this is quite straightforward.

val df = Seq(
(1, "First Name", "Ram"),
(1, "Last Name", "Shri"),
(1, "Address", "Ayodhya"),
(2, "First Name", "Laxman"),
(2, "Last Name", "Shri"),
(2, "Address", "Ayodhya"),
(2, "Skill", "Archery"),
(2, "Marital Status", "Married"),
(2, "Age", "23"),
(3, "First Name", "Sita"),
(3, "Last Name", "Devi"),
(3, "Address", "Ayodhya")
).toDF("userCode", "propertyName", "propertyValue")

df.groupBy("userCode").pivot("propertyName").agg(first("propertyValue")).show

+--------+-------+----+----------+---------+--------------+-------+
|userCode|Address| Age|First Name|Last Name|Marital Status|  Skill|
+--------+-------+----+----------+---------+--------------+-------+
|       1|Ayodhya|null|       Ram|     Shri|          null|   null|
|       2|Ayodhya|  23|    Laxman|     Shri|       Married|Archery|
|       3|Ayodhya|null|      Sita|     Devi|          null|   null|
+--------+-------+----+----------+---------+--------------+-------+
Comments