 Roger -4 years ago 239
R Question

# Get row indices of data frame A according to multiple matching criteria in that data frame and another data frame, B

Let's say we have two data frames in R,

`df.A`
and
`df.B`
, defined thus:

``````bin_name <- c('bin_1','bin_2','bin_3','bin_4','bin_5')
bin_min <- c(0,2,4,6,8)
bin_max <- c(2,4,6,8,10)
df.A <- data.frame(bin_name, bin_min, bin_max, stringsAsFactors = FALSE)

obs_ID <- c('obs_1','obs_2','obs_3','obs_4','obs_5','obs_6','obs_7','obs_8','obs_9','obs_10')
obs_min <- c(6.5,0,8,2,1,7,5,6,8,3)
obs_max <- c(7,3,10,3,9,8,5.5,8,10,4)
df.B <- data.frame(obs_ID, obs_min, obs_max, stringsAsFactors = FALSE)
``````

`df.A`
defines the ranges of bins, while
`df.B`
consists of rows of observations with min and max values that may or may not fall entirely within a bin defined in
`df.A`
.

We want to generate a new vector of length
`nrow(df.B)`
containing the row indices of
`df.A`
corresponding to the bin in which each observation falls entirely. If an observation straddles a bin falls or partially outside it, then it can't be assigned to a bin and should return
`NA`
(or something similar).

In the above example, the correct output vector would be this:

``````bin_rows <- c(4, NA, 5, 2, NA, 4, 3, 4, 5, 2)
``````

I came up with a long-winded solution using
`sapply`
:

``````bin_assignments <- sapply(1:nrow(df.B), function(i) which(df.A\$bin_max >= df.B\$obs_max[i] & df.A\$bin_min <= df.B\$obs_min[i])) #get bin assignments for every observation
bin_assignments[bin_assignments == "integer(0)"] <- NA #replace "integer(0)" entries with NA
bin_assignments <- do.call("c", bin_assignments) #concatenate the output of the sapply call
``````

Several months ago I discovered a simple, single-line solution to this problem that didn't use an apply function. However, I forgot how I did this and I have not been able to rediscover it! The solution might involve
`match()`
or
`which()`
. Any ideas? G. Grothendieck

1) Using SQL it can readily be done in one statement:

``````library(sqldf)

sqldf('select a.rowid
from "df.B" b
left join "df.A" a on obs_min >= bin_min and obs_max <= bin_max')

rowid
1      4
2     NA
3      5
4      2
5     NA
6      4
7      3
8      4
9      5
10     2
``````

2) merge/by We can do it in two statements using `merge` and `by`. No packages are used.

This does have the downside that it materializes the large join which the SQL solution would not need to do.

Note that `df.B`, as defined in the question, has `obs_10` is the second level rather than the 10th level. If it were such that `obs_10` were the 10th level then the second argument to `by` could have been just `m\$obs_ID` so fixing up the input first could simplify it.

``````m <- merge(df.B, df.A)
stack(by(m, as.numeric(sub(".*_", "", m\$obs_ID)),
with, c(which(obs_min >= bin_min & obs_max <= bin_max), NA)))
``````

giving:

``````   values ind
1       4   1
2      NA   2
3       5   3
4       2   4
5      NA   5
6       4   6
7       3   7
8       4   8
9       5   9
10      2  10
``````

3) sapply Note that using the `c(..., NA)` trick from (2) we can simplify the `sapply` solution in the quesiton to one statement:

``````sapply(1:nrow(df.B), function(i)
c(which(df.A\$bin_max >= df.B\$obs_max[i] & df.A\$bin_min <= df.B\$obs_min[i]), NA))
``````

giving:

``````  4 NA  5  2 NA  4  3  4  5  2
``````

4) outer Here is another one statement solution that uses no packages.

``````seq_len(nrow(df.A)) %*%
(outer(df.A\$bin_max, df.B\$obs_max, ">=") & outer(df.A\$bin_min, df.B\$obs_min, "<="))
``````

giving:

``````     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    0    5    2    0    4    3    4    5     2
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download