Michael MacAskill Michael MacAskill - 3 months ago 18
R Question

'Forward' cumulative sum in dplyr

When examining datasets from longitudinal studies, I commonly get results like this from a

dplyr
analysis chain from the raw data:

df = data.frame(n_sessions=c(1,2,3,4,5), n_people=c(59,89,30,23,4))


i.e. a count of how many participants have completed a certain number of assessments at this point in time.

Although it is useful to know how many people have completed exactly n sessions, we more often need to know how many have completed at least n sessions. As per the table below, a standard cumulative sum isn't appropriate, What we want are the values in the
n_total
column, which is a sort of "forwards cumulative sum" of the values in the
n_people
column. i.e. the value in each row should be the sum of the values of itself and all values above it, rather than the standard cumulative sum, which is the sum of all values up to and including itself:

n_sessions n_people n_total cumsum
1 59 205 59
2 89 146 148
3 30 57 178
4 23 27 201
5 4 4 205


Generating the cumulative sum is simple:

mutate(df, cumsum = cumsum(n_people))


What would be an expression for generating a "forwards cumulative sum" that could be incorporated in a
dplyr
analysis chain? I'm guessing that
cumsum
would need to be applied to
n_people
after sorting by
n_sessions
descending, but can't quite get my head around how to get the answer while preserving the original order of the data frame.

Answer

You can take a cumulative sum of the reversed vector, then reverse that result. The built-in rev function is helpful here:

mutate(df, rev_cumsum = rev(cumsum(rev(n_people))))

For example, on your data this returns:

  n_sessions n_people rev_cumsum
1          1       59        205
2          2       89        146
3          3       30         57
4          4       23         27
5          5        4          4