user85727 user85727 - 2 months ago 7
R Question

Selecting factors based upon distance between two rows in a column

I am trying to select the ids from the data frame below that have values of 2 in the num column that are at least 2 rows apart. In other words, given the following data frame:

df2 <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
num=c(1,2,1,1,2,1,1,1,2,2,1,1,1,2,2))
df2$id <- as.factor(df2$id)


How can I select the id variables such that values of 2 are separated by at least one row?

The result of the procedure I am seeking would select id 1, since the values of 2 are separated by at least one row (2 rows in this example). Thanks.

Also, the solution below doesn't seem to work on the following example:

df <- data.frame(
id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4),
num=c(1,2,1, 1,2,2, 1,1,1,2,2,1, 1,1,1,2,2,1, 1,2,1,2,2,2)
)
df$id<-as.factor(df$id)


Again, the result should be only id 1. I just want to find, per ID, any instances where the number 2 is separated by another 2 by one or more rows.

Answer

data.table I would use the data.table package:

library(data.table)
setDT(df)

df[num == 2, .N > 1L && any(diff(.I) > 2L), by=id]

   id    V1
1:  1  TRUE
2:  2 FALSE
3:  3 FALSE

How it works: The syntax is DT[i, j, by]

  • i subsets rows
  • then by groups the remaining rows
  • then j is computed

Inside j we have .I, which stores row number for convenience; and .N, the number of rows in the by group. So to extract the ids have V1 == TRUE, we can do res[V1 == TRUE, id]. Alternately, the whole operation can be chained like

df[num == 2, .N > 1L && any(diff(.I) > 2L), by=id][V1 == TRUE, id]

[1] 1
Levels: 1 2 3

base or dplyr The (almost) analogue in base R is

sapply(split(df$num == 2, df$id), function(x){
  w = which(x)
  length(w) > 1L && any(diff(w) > 2L) 
})

    1     2     3 
 TRUE FALSE FALSE 

The latter also works in dplyr

library(dplyr)

df %>% group_by(id) %>% summarise(x = {
  w = which(num == 2)
  length(w) > 1L && any(diff(w) > 2L) 
})

# A tibble: 3 x 2
      id     x
  <fctr> <lgl>
1      1  TRUE
2      2 FALSE
3      3 FALSE