Maximilian Kohl Maximilian Kohl - 1 year ago 142
R Question

SparkR merge without creating duplicate columns

I decided to create this question because the answer provided to a very similar problem does not work for my scenario:

I want to merge these two SparkR dataframes df1

col1 col2
1 11
2 22
3 33
4 44


and df2

col1 col2 newcol
1 11 401
2 22 402
3 33 403
4 44 404


The dfs should be merged on columns 1 and 2. I tried all steps proposed in the link above (e.g. omitting "by" statement), but the result always looks like this:

col1_x col2_x col1_y col2_y newcol
1 11 1 11 401
2 22 2 22 402
3 33 3 33 403
4 44 4 44 404


I don't want to have duplicate columns and for now I have to delete col1_y, col2_y manually and rename col1_x, col2_x to only have col1, col2 and newcol as result:

df <- merge(df1, df2) # <- how can I change this?
df$col1_y <- NULL
df$col2_y <- NULL
df <- withColumnRenamed(df, "col1_x", "col1")
df <- withColumnRenamed(df, "col2_x", "col2")


I tried all kinds of combinations using the by, by.x and by.y statement but none worked. Is there a way I can simplify this tedious correction chain by omitting one of the steps? Thank you!

Answer Source

This is expected behaviour, one option is to select() relevant columns from the resulting DataFrame:

library(magrittr) 

# We'll use join to avoid suffixes '_x' & '_y'
join(df1,df2,(df1$col1==df2$col1)&(df1$col2==df2$col2)) %>%
  select(df1$col1,df1$col2,df2$newcol) %>%
  showDF()
#+----+----+------+
#|col1|col2|newcol|
#+----+----+------+
#|   1|11.0|   401|
#|   2|22.0|   402|
#|   3|33.0|   403|
#|   4|55.0|   404|
#+----+----+------+