Aaron H Aaron H - 1 month ago 6
Scala Question

Error With Multiple withColumn in Apache Spark

This line of code is not working the way I thought it would:

val df2 = df1
.withColumn("email_age", when('age_of_email <= 60, 1))
.withColumn("email_age", when('age_of_email <= 120, 2))
.withColumn("email_age", when('age_of_email <= 180, 3).otherwise(4))


I have thousands of lines in df1 with
age_of_email
that are less than 60 and/or less than 120, but all my lines are getting categorized as 3 or 4:

Any insight into why this is happening?

Answer

As people have said in the comments, using withColumn with a column name that is already in the dataframe will replace that column.

I think for what you want to achieve you might either use different column names for each categorization or simply concatenate the when() in a single column like

val df2 = df1.withColumn("email_age", when('age_of_email <= 60, 1)
                                     .when('age_of_email <= 120, 2)
                                     .when('age_of_email <= 180, 3)
                                     .otherwise(4))

I guess you're aware that the categories are subsets of category 3

Comments