Dan Goldstein Dan Goldstein - 3 months ago 13
R Question

How to join (merge) data frames (inner, outer, left, right)?

Given two data frames:

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio

df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio


How can I do database style, i.e., sql style, joins? That is, how do I get:


  • An inner join of df1 and df2:

    Return only the rows in which the left table have matching keys in the right table.

  • An outer join of df1 and df2:

    Returns all rows from both tables, join records from the left which have matching keys in the right table.

  • A left outer join (or simply left join) of df1 and df2

    Return all rows from the left table, and any rows with matching keys from the right table.

  • A right outer join of df1 and df2

    Return all rows from the right table, and any rows with matching keys from the left table.



Extra credit:

How can I do a sql style select statement?

Answer

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

Comments