HansHupe HansHupe - 2 months ago 6
R Question

Grouping of R dataframe by connected values

I didn't find a solution for this common grouping problem in R:

This is my original dataset

ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C


This should be my grouped resulting dataset

State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10


So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.

Any ideas?

Answer

You could try:

library(dplyr)
df %>%
  mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
  group_by(rleid) %>%
  summarise(State = first(State), min = min(ID), max = max(ID)) %>%
  select(-rleid)

Which gives:

## A tibble: 4 × 3
#   State   min   max
#  <fctr> <int> <int>
#1      A     1     2
#2      B     3     5
#3      A     6     8
#4      C     9    10
Comments