HansHupe - 1 year ago 52

R Question

I didn't find a solution for this common grouping problem in R:

This is my original dataset

`ID State`

1 A

2 A

3 B

4 B

5 B

6 A

7 A

8 A

9 C

10 C

This should be my grouped resulting dataset

`State min(ID) max(ID)`

A 1 2

B 3 5

A 6 8

C 9 10

So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.

Any ideas?

Answer Source

You could try:

```
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
```

Which gives:

```
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
```