Darshan Mehta Darshan Mehta -4 years ago 244
Scala Question

Spark : Pivot with multiple columns

I have the following

json
file that I am reading in Spark:

{"id" : "123", "category" : "A"}
{"id" : "456", "category" : "B"}
{"id" : "567", "category" : "C"}
{"id" : "678", "category" : "D"}
{"id" : "789", "category" : "E"}
{"id" : "123", "category" : "B"}
{"id" : "123", "category" : "C"}


I want to
pivot
it to generate mxn matrix of categories. Below is my code:

val df = spark.read.json("test.json")
val pivots = df.groupBy("category").pivot("category").count()
pivots.show()


This generates the following output:

+--------+----+----+----+----+----+
|category| A| B| C| D| E|
+--------+----+----+----+----+----+
| E|null|null|null|null| 1|
| B|null| 2|null|null|null|
| D|null|null|null| 1|null|
| C|null|null| 2|null|null|
| A| 1|null|null|null|null|
+--------+----+----+----+----+----+


What I really want to do is, to pivot by
id
and show the counts in this matrix. E.g.
id
'123' is mapped to categories
A
,
B
and
C
, so I want the above matrix to show
1
value for
A-A
,
A-B
and
A-C
(and vice versa), currently, it's only showing
1
value for
A-A
.

I changed the code to
group by
'id' but that results in
id
values in first column whereas I want categories.

Is there any way I can do this (maybe using a completely different approach)?

Answer Source

Fist rename the columns and apply self-join:

val leftRight = df
  .withColumnRenamed("category", "left")
  .join(df.withColumnRenamed("category", "right"), Seq("id"))

to get co-occurrences for each id. Next apply crosstab:

leftRight.stat.crosstab("left", "right")

to aggregate data across all ids. The result is:

+----------+---+---+---+---+---+
|left_right|  A|  B|  C|  D|  E|
+----------+---+---+---+---+---+
|         E|  0|  0|  0|  0|  1|
|         A|  1|  1|  1|  0|  0|
|         B|  1|  2|  1|  0|  0|
|         C|  1|  1|  2|  0|  0|
|         D|  0|  0|  0|  1|  0|
+----------+---+---+---+---+---+
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download