H.Z. H.Z. - 3 months ago 4
Scala Question

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?

I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.

Answer

You can try to achieve something like that by doing a full outer join like the following:

val result = df1.join(df2, Seq("cust_id"), "full_outer")

However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:

val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
  (df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)
Comments