I have one data frame with an ID:String column, a Type:Int column and a Name:String column.
This data frame has a lot of missing values in the Name column.
But I also have three other dataframes that contain an ID column and a Name column.
What I'd like to do is to fill the missing values in the first Dataframe with values from the others. The other dataframes do not contain all the IDs belonging to the first dataframe, plus they can also contain IDs that are not present in the first dataframe.
What is the right approach in this case? I Know I can combine two DFs like:
df1.join(df2, df1("ID")===df2("ID"), "left_outer")
You can split,
join the subset of interest and gather everything back:
df1 // Select ones that may require filling .where($"type" === 1) // Join .join(df2, Seq("ID"), "left_outer") // Replace NULL if needed .select($"ID", $"Type", coalesce(df1("Name"), df2("Name")).alias("Name")) // Union with subset which doesn't require filling .union(df1.where($"type" === 2)) // Or =!= 1 as suggested by @AlbertoBonsanto
type column is
nullable you should cover this scenario separately with