aju_k aju_k - 3 months ago 33
R Question

R delete last row in dataframe for each group

I would like to delete the last row in a dataframe for each group in R based on max(start_date).

Example data:

id start_date end_date
1 2016-01-14 2016-02-14
1 2016-03-14 2016-08-05
2 2014-01-14 2014-02-14
2 2015-03-21 2015-05-21
2 2015-08-23 2015-09-23
2 2015-11-21 2016-01-03


Result:

id start_date end_date
1 2016-01-14 2016-02-14
2 2014-01-14 2014-02-14
2 2015-03-21 2015-05-21
2 2015-08-23 2015-09-23


The following does not work:

df <- df %>%
group_by(id) %>%
summarise(start_date != max(start_date))


Error: found duplicated column name: id

df <- sqldf("select * from df group by id having start_date != max(start_date)")


error in statement: duplicate column name: id

Any suggestions would be great.

Answer

We can use slice (assuming that the dates are already ordered)

df1 %>% 
   group_by(id) %>% 
   slice(-n())
#     id start_date   end_date
#   <int>      <chr>      <chr>
#1     1 2016-01-14 2016-02-14
#2     2 2014-01-14 2014-02-14
#3     2 2015-03-21 2015-05-21
#4     2 2015-08-23 2015-09-23

If the dates are not ordered, then arrange and slice

df1 %>%
   group_by(id) %>%
   arrange(start_date) %>%
   slice(-n()) 

Based on some previous benchmarks (couldn't find the link), the arrange/slice method would be faster than comparing start_date != max(start_date)