Swadeep Swadeep - 5 months ago 149
Python Question

Convert comma separated string to array in pyspark dataframe

I have a dataframe as below where ev is of type string.

>>> df2.show()
+---+--------------+
| id| ev|
+---+--------------+
| 1| 200, 201, 202|
| 1|23, 24, 34, 45|
| 1| null|
| 2| 32|
| 2| null|
+---+--------------+


Is there a way to cast ev to type ArrayType without using UDF or UDF is the only option to do that?

Answer

You can use built-in split function:

from pyspark.sql.functions import col, split

df = sc.parallelize([
    (1, "200, 201, 202"), (1, "23, 24, 34, 45"), (1, None),
    (2, "32"), (2, None)]).toDF(["id", "ev"])

df.select(col("id"), split(col("ev"), ",\s*").alias("ev"))