user2205916 user2205916 - 1 month ago 22
Python Question

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the following, written in pseudocode for clarity:

df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)


I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into
expr()
but couldn't get it to work.

Note that
df
is a
pyspark.sql.dataframe.DataFrame
.

Answer

There are a few efficient ways to implement this. Let's start with required imports:

from pyspark.sql.functions import col, expr, when

You can use Hive IF function inside expr:

new_column_1 = expr(
    """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)

or when + otherwise:

new_column_2 = when(
    col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)

Finally you could use following trick:

from pyspark.sql.functions import coalesce, lit

new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))

With example data:

df = sc.parallelize([
    ("orange", "apple"), ("kiwi", None), (None, "banana"), ("mango", "mango")
]).toDF(["fruit1", "fruit2"])

you can use this as follows:

(df
    .withColumn("new_column_1", new_column_1)
    .withColumn("new_column_2", new_column_2)
    .withColumn("new_column_3", new_column_3))

and the result is:

+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple|           0|           0|           0|
|  kiwi|  null|           3|           3|           3|
|  null|banana|           3|           3|           3|
| mango| mango|           1|           1|           1|
+------+------+------------+------------+------------+
Comments