Michael MacAskill Michael MacAskill - 1 month ago 8x
R Question

'Forward' cumulative sum in dplyr

When examining datasets from longitudinal studies, I commonly get results like this from a

analysis chain from the raw data:

df = data.frame(n_sessions=c(1,2,3,4,5), n_people=c(59,89,30,23,4))

i.e. a count of how many participants have completed a certain number of assessments at this point in time.

Although it is useful to know how many people have completed exactly n sessions, we more often need to know how many have completed at least n sessions. As per the table below, a standard cumulative sum isn't appropriate, What we want are the values in the
column, which is a sort of "forwards cumulative sum" of the values in the
column. i.e. the value in each row should be the sum of the values of itself and all values above it, rather than the standard cumulative sum, which is the sum of all values up to and including itself:

n_sessions n_people n_total cumsum
1 59 205 59
2 89 146 148
3 30 57 178
4 23 27 201
5 4 4 205

Generating the cumulative sum is simple:

mutate(df, cumsum = cumsum(n_people))

What would be an expression for generating a "forwards cumulative sum" that could be incorporated in a
analysis chain? I'm guessing that
would need to be applied to
after sorting by
descending, but can't quite get my head around how to get the answer while preserving the original order of the data frame.


You can take a cumulative sum of the reversed vector, then reverse that result. The built-in rev function is helpful here:

mutate(df, rev_cumsum = rev(cumsum(rev(n_people))))

For example, on your data this returns:

  n_sessions n_people rev_cumsum
1          1       59        205
2          2       89        146
3          3       30         57
4          4       23         27
5          5        4          4