emehex emehex - 3 months ago 9
R Question

Filter on the first (min) date

My data looks approximately like this:

Snap Date ID Stage
1 2014-01-01 A1 One
2 2014-01-02 A1 One
3 2014-01-03 A1 One
4 2014-01-04 A1 Two
5 2014-01-05 A1 Two
6 2014-01-01 B9 One
7 2014-01-02 B9 One
8 2014-01-03 B9 Two
9 2014-01-04 B9 Three


How can I filter the entries where
Stage
actually changes and remove everything else in between.

Desired Output:

Snap Date ID Stage
1 2014-01-01 A1 One
4 2014-01-04 A1 Two
6 2014-01-01 B9 One
8 2014-01-03 B9 Two
9 2014-01-04 B9 Three


Additionally, how might the solution change if there were multiple columns to filter on?

Snap Date ID Stage Colour
1 2014-01-01 A1 One Red
2 2014-01-02 A1 One Red
3 2014-01-03 A1 One Green
4 2014-01-04 A1 One Green
5 2014-01-05 A1 Two Green
6 2014-01-06 A1 Two Green
7 2014-01-07 A1 Two Blue
8 2014-01-08 A1 Two Blue
9 2014-01-09 A1 Three Blue
10 2014-01-10 A1 Three Blue
11 2014-01-11 A1 Four Blue
12 2014-01-12 A1 Four Blue
13 2014-01-13 A1 Four Blue
14 2014-01-14 A1 Four Blue
15 2014-01-15 A1 Four Blue
16 2014-01-04 B9 One Green
17 2014-01-05 B9 One Green
18 2014-01-06 B9 Two Green
19 2014-01-07 B9 Three Green

Answer

You can use data.tables unique function and its by attribute which you can update at will.

For the original question

library(data.table)
unique(setDT(df), by = c("ID", "Stage"))
#    Snap       Date ID Stage
# 1:    1 2014-01-01 A1   One
# 2:    4 2014-01-04 A1   Two
# 3:    6 2014-01-01 B9   One
# 4:    8 2014-01-03 B9   Two
# 5:    9 2014-01-04 B9 Three

For the Edit3: Just color to the by parameter

unique(df, by = c("ID", "Stage", "Colour"))
#    Snap       Date ID Stage Colour
# 1:    1 2014-01-01 A1   One    Red
# 2:    3 2014-01-03 A1   One  Green
# 3:    5 2014-01-05 A1   Two  Green
# 4:    7 2014-01-07 A1   Two   Blue
# 5:    9 2014-01-09 A1 Three   Blue
# 6:   11 2014-01-11 A1  Four   Blue
# 7:   16 2014-01-04 B9   One  Green
# 8:   18 2014-01-06 B9   Two  Green
# 9:   19 2014-01-07 B9 Three  Green

Additional option is using which.min (like you mentioned)

df[, .SD[which.min(Date)], .(ID, Stage, Colour)]

Or using dplyr

library(dplyr)
distinct(df, ID, Stage, Colour)
Comments