James Marquez James Marquez - 2 months ago 8
R Question

How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show. It seems like only some tweaks need to be made to my below code. However, I cannot seem to find the answer anywhere on this site.

transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))


The above code produces the below output:

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 4
#3 1 0 0
#4 1 1 4
#5 1 1 4
#6 1 1 4
#7 2 0 0
#8 2 0 0
#9 2 1 3
#10 2 1 3
#11 2 0 0
#12 2 1 3


This is what I desire:

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 0
#3 1 0 1
#4 1 1 0
#5 1 1 1
#6 1 1 2
#7 2 0 0
#8 2 0 0
#9 2 1 0
#10 2 1 1
#11 2 0 2
#12 2 1 0


[UPDATE] I would like the consecutive count to be offset by one row down.

Thank you for any help you can offer in advance!

Answer

And here's another (similar) data.table approach

library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
#     PatientID Noshow ConsecNoshows
#  1:         1      0             0
#  2:         1      1             1
#  3:         1      0             0
#  4:         1      1             1
#  5:         1      1             2
#  6:         1      1             3
#  7:         2      0             0
#  8:         2      0             0
#  9:         2      1             1
# 10:         2      1             2
# 11:         2      0             0
# 12:         2      1             1

This is basically groups by PatientID and "run-length-encoding" of Noshow and creates sequences using the group sizes while multiplying by Noshow in order to keep only the values when Noshow == 1