Roger Roger - 1 month ago 12
R Question

Get row indices of data frame A according to multiple matching criteria in that data frame and another data frame, B

Let's say we have two data frames in R,

df.A
and
df.B
, defined thus:

bin_name <- c('bin_1','bin_2','bin_3','bin_4','bin_5')
bin_min <- c(0,2,4,6,8)
bin_max <- c(2,4,6,8,10)
df.A <- data.frame(bin_name, bin_min, bin_max, stringsAsFactors = FALSE)

obs_ID <- c('obs_1','obs_2','obs_3','obs_4','obs_5','obs_6','obs_7','obs_8','obs_9','obs_10')
obs_min <- c(6.5,0,8,2,1,7,5,6,8,3)
obs_max <- c(7,3,10,3,9,8,5.5,8,10,4)
df.B <- data.frame(obs_ID, obs_min, obs_max, stringsAsFactors = FALSE)


df.A
defines the ranges of bins, while
df.B
consists of rows of observations with min and max values that may or may not fall entirely within a bin defined in
df.A
.

We want to generate a new vector of length
nrow(df.B)
containing the row indices of
df.A
corresponding to the bin in which each observation falls entirely. If an observation straddles a bin falls or partially outside it, then it can't be assigned to a bin and should return
NA
(or something similar).

In the above example, the correct output vector would be this:

bin_rows <- c(4, NA, 5, 2, NA, 4, 3, 4, 5, 2)


I came up with a long-winded solution using
sapply
:

bin_assignments <- sapply(1:nrow(df.B), function(i) which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i])) #get bin assignments for every observation
bin_assignments[bin_assignments == "integer(0)"] <- NA #replace "integer(0)" entries with NA
bin_assignments <- do.call("c", bin_assignments) #concatenate the output of the sapply call


Several months ago I discovered a simple, single-line solution to this problem that didn't use an apply function. However, I forgot how I did this and I have not been able to rediscover it! The solution might involve
match()
or
which()
. Any ideas?

Answer

1) Using SQL it can readily be done in one statement:

library(sqldf)

sqldf('select a.rowid
       from "df.B" b 
       left join "df.A" a on obs_min >= bin_min and obs_max <= bin_max')

   rowid
1      4
2     NA
3      5
4      2
5     NA
6      4
7      3
8      4
9      5
10     2

2) merge/by We can do it in two statements using merge and by. No packages are used.

This does have the downside that it materializes the large join which the SQL solution would not need to do.

Note that df.B, as defined in the question, has obs_10 is the second level rather than the 10th level. If it were such that obs_10 were the 10th level then the second argument to by could have been just m$obs_ID so fixing up the input first could simplify it.

m <- merge(df.B, df.A)
stack(by(m, as.numeric(sub(".*_", "", m$obs_ID)), 
      with, c(which(obs_min >= bin_min & obs_max <= bin_max), NA)[1]))

giving:

   values ind
1       4   1
2      NA   2
3       5   3
4       2   4
5      NA   5
6       4   6
7       3   7
8       4   8
9       5   9
10      2  10

3) sapply Note that using the c(..., NA)[1] trick from (2) we can simplify the sapply solution in the quesiton to one statement:

sapply(1:nrow(df.B), function(i)
  c(which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i]), NA)[1]) 

giving:

[1]  4 NA  5  2 NA  4  3  4  5  2

4) outer Here is another one statement solution that uses no packages.

seq_len(nrow(df.A)) %*% 
  (outer(df.A$bin_max, df.B$obs_max, ">=") & outer(df.A$bin_min, df.B$obs_min, "<="))

giving:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    0    5    2    0    4    3    4    5     2