Aaron H Aaron H - 5 months ago 30
Scala Question

Error With Multiple withColumn in Apache Spark

This line of code is not working the way I thought it would:

val df2 = df1
.withColumn("email_age", when('age_of_email <= 60, 1))
.withColumn("email_age", when('age_of_email <= 120, 2))
.withColumn("email_age", when('age_of_email <= 180, 3).otherwise(4))

I have thousands of lines in df1 with
that are less than 60 and/or less than 120, but all my lines are getting categorized as 3 or 4:

Any insight into why this is happening?


As people have said in the comments, using withColumn with a column name that is already in the dataframe will replace that column.

I think for what you want to achieve you might either use different column names for each categorization or simply concatenate the when() in a single column like

val df2 = df1.withColumn("email_age", when('age_of_email <= 60, 1)
                                     .when('age_of_email <= 120, 2)
                                     .when('age_of_email <= 180, 3)

I guess you're aware that the categories are subsets of category 3