andrasz andrasz - 5 days ago 7
R Question

subset where there are at least five consecutive years in a data.frame column

I have a data.frame / data.table in R as follows:

df <- data.frame(
ID=c(rep("A", 20)),
year=c(1968, 1971, 1972, 1973, 1974, 1976, 1978, 1980, 1982, 1984, 1985,
1986, 1987, 1988, 1990, 1991, 1992, 1993, 1994, 1995))


I'd like to subset the df in order to keep only those entries which have at least five consecutive years. In this example this is the case in two periods (1984:1988 and 1990:1995). How can I do this in R? Thanks for any support.

Answer

A compact solution using diff and cumsum:

setDT(df)[, grp := cumsum(c(0, diff(year)) > 1), by = ID
          ][, if (.N > 4) .SD, by = grp][, grp := NULL][]

which gives the desired result:

    ID year
 1:  A 1984
 2:  A 1985
 3:  A 1986
 4:  A 1987
 5:  A 1988
 6:  A 1990
 7:  A 1991
 8:  A 1992
 9:  A 1993
10:  A 1994
11:  A 1995

A compareble approach in base R:

i <- with(df, ave(year, ID, FUN = function(x) { 
  r <- rle(cumsum(c(0, diff(year)) > 1));
  rep(r$lengths, r$lengths)
  } ))

df[i > 4,] # or df[which(i > 4),]

which will get you the same result.

Comments