emehex - 2 years ago 81
R Question

# Filter on the first (min) date

My data looks approximately like this:

``````  Snap Date   ID   Stage
1 2014-01-01  A1   One
2 2014-01-02  A1   One
3 2014-01-03  A1   One
4 2014-01-04  A1   Two
5 2014-01-05  A1   Two
6 2014-01-01  B9   One
7 2014-01-02  B9   One
8 2014-01-03  B9   Two
9 2014-01-04  B9   Three
``````

How can I filter the entries where
`Stage`
actually changes and remove everything else in between.

Desired Output:

``````  Snap Date   ID   Stage
1 2014-01-01  A1   One
4 2014-01-04  A1   Two
6 2014-01-01  B9   One
8 2014-01-03  B9   Two
9 2014-01-04  B9   Three
``````

Additionally, how might the solution change if there were multiple columns to filter on?

``````   Snap Date   ID   Stage  Colour
1  2014-01-01  A1   One    Red
2  2014-01-02  A1   One    Red
3  2014-01-03  A1   One    Green
4  2014-01-04  A1   One    Green
5  2014-01-05  A1   Two    Green
6  2014-01-06  A1   Two    Green
7  2014-01-07  A1   Two    Blue
8  2014-01-08  A1   Two    Blue
9  2014-01-09  A1   Three  Blue
10 2014-01-10  A1   Three  Blue
11 2014-01-11  A1   Four   Blue
12 2014-01-12  A1   Four   Blue
13 2014-01-13  A1   Four   Blue
14 2014-01-14  A1   Four   Blue
15 2014-01-15  A1   Four   Blue
16 2014-01-04  B9   One    Green
17 2014-01-05  B9   One    Green
18 2014-01-06  B9   Two    Green
19 2014-01-07  B9   Three  Green
``````

You can use data.tables `unique` function and its `by` attribute which you can update at will.

For the original question

``````library(data.table)
unique(setDT(df), by = c("ID", "Stage"))
#    Snap       Date ID Stage
# 1:    1 2014-01-01 A1   One
# 2:    4 2014-01-04 A1   Two
# 3:    6 2014-01-01 B9   One
# 4:    8 2014-01-03 B9   Two
# 5:    9 2014-01-04 B9 Three
``````

For the Edit3: Just `color` to the `by` parameter

``````unique(df, by = c("ID", "Stage", "Colour"))
#    Snap       Date ID Stage Colour
# 1:    1 2014-01-01 A1   One    Red
# 2:    3 2014-01-03 A1   One  Green
# 3:    5 2014-01-05 A1   Two  Green
# 4:    7 2014-01-07 A1   Two   Blue
# 5:    9 2014-01-09 A1 Three   Blue
# 6:   11 2014-01-11 A1  Four   Blue
# 7:   16 2014-01-04 B9   One  Green
# 8:   18 2014-01-06 B9   Two  Green
# 9:   19 2014-01-07 B9 Three  Green
``````

Additional option is using `which.min` (like you mentioned)

``````df[, .SD[which.min(Date)], .(ID, Stage, Colour)]
``````

Or using `dplyr`

``````library(dplyr)
distinct(df, ID, Stage, Colour)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download