ecolog ecolog - 2 months ago 7
R Question

r - Create a sequence number for each row within a category (defined by 2+ fields) in a dataframe

I want to generate an id number within each group/subset of a dataframe Where each group is defined by two fields or more. In this test dataset I want to use "personid" and "date" as my category:

personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23


I wish to add an id column with a value for each unique combination of the two column"personid" and "date", always starting with 1. This is my desired output:

personid date measurement id
1 x 23 1
1 x 32 1
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3


This is a similar question to the 3 year old version
Create a unique, sequential number for each row within each group/subset of a dataframe but after many attempts I wasn't able to extend their logic for my 2+fields category definition. Thanks!

Answer

Same idea as @Procrastinatus Maximus's rleid, here is a dplyr version of it:

library(dplyr)
df %>% 
      arrange(personid, date) %>% 
      group_by(personid) %>% 
      mutate(id = cumsum(date != lag(date, default = first(date))) + 1)
      # +1 converts the zero based id to one based id here

# Source: local data frame [6 x 4]
# Groups: personid [3]
#
#   personid   date measurement    id
#      <int> <fctr>       <int> <dbl>
# 1        1      x          23     1
# 2        1      x          32     1
# 3        2      y          21     1
# 4        3      x          23     1
# 5        3      y          23     2
# 6        3      z          23     3

In order for rleid or cumsum to work here, we have to sort the data frame by personid and then date since both methods only care about adjacent values.