Saikar Saikar - 3 months ago 9
R Question

Using data.table to identify all event occurence with condition of picking first occurence if in sequence

I am trying to identify all occurrence of an event and if repeating in sequence pick the first occurrence. I am able to tag and add a count but not able to reset the count after there is change in event.

My data has ~1M rows with 30 odd IDs. I have added only one ID but have 30 odd IDs in my data. The table has an ID, datetime and status.

Status is the event which can have multiple values -A, B, C... The event I am concerned is for B.

I want to add three columns -

Occurrence_B - Flag for event is B

Count_B - Counts the consecutive occurrence for event=B with reset when the event changes

Include_B - Flag to show if that specific occurrence is the first or continued occurrence

I will subset the data where Include_B='new' to pick the first occurrence among a sequence.

ID Date Status Occurrence_B Count_B Include_B

A 7/28/15 12:00 AM A 0 0 0

A 7/28/15 12:30 AM A 0 0 0

A 7/30/15 12:00 AM B 1 1 new

A 7/31/15 12:00 AM B 1 2 continued

A 7/31/15 11:00 AM B 1 3 continued

A 8/2/15 10:00 AM B 0 0 0

A 8/3/15 12:00 AM C 0 0 0

A 8/4/15 12:00 AM B 1 1 new

A 8/5/15 12:00 AM B 1 2 continued

A 8/6/15 12:00 AM A 1 0 continued

A 8/7/15 12:00 AM B 1 1 new


table_picture

My sample code--

d1[, Occurrence_B:=Status %in% c('B')+0L]

d1[, Count_B := cumsum(Occurrence_B), by=.(ID,Status)]


Issue is I don't know how to reset count_B once event changes. I am trying to investigate but I am new to data.table so will greatly appreciate any help.

Please let me know if you have any questions.

SK

Answer

You may try something like this:

# create Occurrence_B column and initialize Include_B as NA
(d1[, `:=` (Occurrence_B = as.integer(Status == "B"), Include_B = NA_character_)]

  # calculate Count_B use rleid(Occurrence_B) as group variable which will group consecutive
  # same values together
  [, Count_B := cumsum(Occurrence_B), by = rleid(Occurrence_B)]

  # Update the Include_B variable in place based on Count_B, when Count_B == 1, it appears 
  # the first time, when Count_B > 1, it is continued, otherwise keep them as NA
  [Count_B == 1, Include_B := "new"][Count_B > 1, Include_B := "continued"][])

# ID                Date Status Occurrence_B Count_B Include_B
# 1:  A 7/28/15 12:00 AM      A            0       0        NA
# 2:  A 7/28/15 12:30 AM      A            0       0        NA
# 3:  A 7/30/15 12:00 AM      B            1       1       new
# 4:  A 7/31/15 12:00 AM      B            1       2 continued
# 5:  A 7/31/15 11:00 AM      B            1       3 continued
# 6:  A  8/2/15 10:00 AM      B            1       4 continued
# 7:  A  8/3/15 12:00 AM      C            0       0        NA
# 8:  A  8/4/15 12:00 AM      B            1       1       new
# 9:  A  8/5/15 12:00 AM      B            1       2 continued
#10:  A  8/6/15 12:00 AM      A            0       0        NA
#11:  A  8/7/15 12:00 AM      B            1       1       new