user299791 user299791 - 11 months ago 38
Scala Question

How to fill missing values with values from other dataframes

I have one data frame with an ID:String column, a Type:Int column and a Name:String column.

This data frame has a lot of missing values in the Name column.

But I also have three other dataframes that contain an ID column and a Name column.

What I'd like to do is to fill the missing values in the first Dataframe with values from the others. The other dataframes do not contain all the IDs belonging to the first dataframe, plus they can also contain IDs that are not present in the first dataframe.

What is the right approach in this case? I Know I can combine two DFs like:

df1.join(df2, df1("ID")===df2("ID"), "left_outer")

But since I know that all entries in the first dataframe where type=2 already have a name, I'd like to restrict this join only for rows where type=1

Any idea how can I retrieve Names values from the three DFs in order to fill the Name column in my original dataframe?

Answer Source

You can split, join the subset of interest and gather everything back:

  // Select ones that may require filling
  .where($"type" === 1)  
  // Join
  .join(df2, Seq("ID"), "left_outer")
  // Replace NULL if needed
  .select($"ID", $"Type", coalesce(df1("Name"), df2("Name")).alias("Name"))
  // Union with subset which doesn't require filling
  .union(df1.where($"type" === 2))  // Or =!= 1 as suggested by @AlbertoBonsanto 

If type column is nullable you should cover this scenario separately with union($"type".isNull).