Douglas Clark - 1 year ago 69
R Question

# Why does X[Y] join of data.tables not allow a full outer join, or a left join?

This is a bit of a philosophical question about data.table join syntax. I am finding more and more uses for data.tables, but still learning...

The join format

`X[Y]`
for data.tables is very concise, handy and efficient, but as far as I can tell, it only supports inner joins and right outer joins. To get a left or full outer join, I need to use
`merge`
:

• `X[Y, nomatch = NA]`
-- all rows in Y -- right outer join (default)

• `X[Y, nomatch = 0]`
-- only rows with matches in both X and Y -- inner join

• `merge(X, Y, all = TRUE)`
-- all rows from both X and Y -- full outer join

• `merge(X, Y, all.x = TRUE)`
-- all rows in X -- left outer join

It seems to me that it would be handy if the
`X[Y]`
join format supported all 4 types of joins. Is there a reason only two types of joins are supported?

For me, the
`nomatch = 0`
and
`nomatch = NA`
parameter values are not very intuitive for the actions being performed. It is easier for me to understand and remember the
`merge`
syntax:
`all = TRUE`
,
`all.x = TRUE`
and
`all.y = TRUE`
. Since the
`X[Y]`
operation resembles
`merge`
much more than
`match`
, why not use the
`merge`
syntax for joins rather than the
`match`
function's
`nomatch`
parameter?

Here are code examples of the 4 join types:

``````# sample X and Y data.tables
library(data.table)
X <- data.table(t = 1:4, a = (1:4)^2)
setkey(X, t)
X
#    t  a
# 1: 1  1
# 2: 2  4
# 3: 3  9
# 4: 4 16

Y <- data.table(t = 3:6, b = (3:6)^2)
setkey(Y, t)
Y
#    t  b
# 1: 3  9
# 2: 4 16
# 3: 5 25
# 4: 6 36

# all rows from Y - right outer join
X[Y]  # default
#  t  a  b
# 1: 3  9  9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36

X[Y, nomatch = NA]  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36

merge(X, Y, by = "t", all.y = TRUE)  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36

identical(X[Y], merge(X, Y, by = "t", all.y = TRUE))
# [1] TRUE

# only rows in both X and Y - inner join
X[Y, nomatch = 0]
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16

merge(X, Y, by = "t")  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16

merge(X, Y, by = "t", all = FALSE)  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16

identical( X[Y, nomatch = 0], merge(X, Y, by = "t", all = FALSE) )
# [1] TRUE

# all rows from X - left outer join
merge(X, Y, by = "t", all.x = TRUE)
#    t  a  b
# 1: 1  1 NA
# 2: 2  4 NA
# 3: 3  9  9
# 4: 4 16 16

# all rows from both X and Y - full outer join
merge(X, Y, by = "t", all = TRUE)
#    t  a  b
# 1: 1  1 NA
# 2: 2  4 NA
# 3: 3  9  9
# 4: 4 16 16
# 5: 5 NA 25
# 6: 6 NA 36
``````

Update: data.table v1.9.6 introduced the
`on=`
syntax, which allows ad hoc joins on fields other than the primary key. jangorecki's answer to the question How to join (merge) data frames (inner, outer, left, right)? provides some examples of additional join types that data.table can handle.

To quote from the data.table FAQ 1.12

## 1.12

What is the difference between X[Y] and merge(X,Y)?

• `X[Y]` is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
• `Y[X]` is a join, looking up Y's rows using X (or X's key if it has one)
• `merge(X,Y)` does both ways at the same time.

The number of rows of `X[Y]` and `Y[X]` usually differ; whereas the number of rows returned by `merge(X,Y)` and `merge(Y,X)` is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards? You may suggest `merge(X[,ColsNeeded1],Y[,ColsNeeded2])`, but that takes copies of the sub- sets of data, and it requires the programmer to work out which columns are needed. `X[Y,j`] in data.table does all that in one step for you. When you write `X[Y,sum(foo*bar)]`, data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't `X[Y,sum(foo*bar)]` quicker to program and quicker to run than a merge followed by a subset?

If you want a left outer join of `X[Y]`

``````le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE
``````

If you want a full outer join

``````# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
##   t  b  a
## 1: 1 NA  1
## 2: 2 NA  4
## 3: 3  9  9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA

# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download