Dmitry Polonskiy Dmitry Polonskiy - 2 months ago 9
Python Question

Breaking down rows in Pyspark DataFrame

I have a PySpark DataFrame in this format:

dbn | bus | subway | score
----------|----------------|----------|--------
XYZ12 | B1, B44, B66 | A, C | 59
ZYY3 | B8, B3, B7 | J, Z | 66


What I want to do is be able to attach the score column to every individual bus and subway line, however I want to work on one column at a time so I'll start with bus. Ultimately what I want my DataFrame to look like is this (when I'm working with the bus column)

dbn | bus | subway | score
---------|-----------|---------|-------
XYZ12 | B1 | A, C | 59
XYZ12 | B44 | A, C | 59
XYZ12 | B66 | A, C | 59
ZYY3 | B8 | J, Z | 66
ZYY3 | B3 | J, Z | 66
ZYY3 | B7 | J, Z | 66


How would I go about doing this?

Answer

You can explode function which expects an array or a map column as an input. If bus is a string you can use string processing functions, like split, to break it into pieces first. Let's assume this scenario:

df = sc.parallelize([
    ("XYZ12", "B1, B44, B66", "A, C", 59),
    ("ZYY3 ", "B8, B3, B7", "J, Z", 66)
]).toDF(["dbn", "bus", "subway", "score"])

First import required functions:

from pyspark.sql.functions import col, explode, split, trim

add column:

with_bus_exploded = df.withColumn("bus", explode(split("bus", ",")))

and trim leading / trailing spaces:

with_bus_trimmed = with_bus_exploded.withColumn("bus", trim(col("bus")))

Finally the result is:

+-----+---+------+-----+
|  dbn|bus|subway|score|
+-----+---+------+-----+
|XYZ12| B1|  A, C|   59|
|XYZ12|B44|  A, C|   59|
|XYZ12|B66|  A, C|   59|
|ZYY3 | B8|  J, Z|   66|
|ZYY3 | B3|  J, Z|   66|
|ZYY3 | B7|  J, Z|   66|
+-----+---+------+-----+