user6571411 user6571411 - 1 month ago 12
R Question

Calculating cumulative sum of event over time

I have a dataframe

df
which contains data on the date of vaccination of individuals with unique identification numbers. An individual is considered vaccinated if he has received three vaccinations by the age of two years old. My aim is to calculate the cumulative sum of fully vaccinated individuals with the end goal of plotting the proportion of the under population under three that is fully vaccinated and any given time
x
. In my mind I've come up with the perfect code but my intuition is failing for some reason and I get a weird increase during the end of the time period. See below.

After considerable data wrangling we start the example data with dataframe
df
where every row is a single vaccination event and a single column dataframe
date
that contains every single date in the time period of interest.

glimpse(df)
Observations: 50,469
Variables: 6
$ id <chr> "1000038", "1000038", "1000038", "1000128", "1000380",...
$ n_max <int> 3, 1, 1, 3, 3, 3, 3,... ###total num times before 2 years old
$ age_y <int> 0, 0, 0, 0, 1, 0, 0,... ###current age for this observation
$ age_m <int> 3, 5, 11, 3, 4,... ###current age in months for this obs
$ date_vacc <date> 2013-05-08, 2013-07-03, 2014-01-13,... ###current date obs
$ year <dbl> 2013, 2013, 2014, 2013,... ###current year of obs

glimpse(date)
Observations: 4,017
Variables: 1
$ date_vacc <date> 2005-01-01, 2005-01-02, 2005-01-03, 2005-01-04, 2005-01-05, 2005-01-06, 2005-01-07, 2005-01-08, 2005-01-09, 20...


Now I exploit the structure of
df
as to change what each row 'means'. At this point each row represents an observation of a single vaccine event, and the following code, first
i)
makes each row represent the date at which the individual received his last vaccine dose, regardless if he received 1, 2 or three in total. Then
ii)
it changes the meaning of the row to represent the number of individuals who received their last dose on a given date.

df <-
df[!duplicated(dfid, fromLast = TRUE),] %>% ###i)
droplevels() %>%
right_join(date) %>%
group_by(date_vacc) %>%
summarise(nsum = n_distinct(id, na.rm = TRUE)) ###ii)

df$nsum <-
ifelse(is.na(df$nsum),
0,
df$nsum)


Finally, this code is supposed to add together the number of people that have received their final dose over the last two years, and as a rolling sum, approximate the number of fully vaccinated two year olds in the population at any given date
x
. Because it sums over a fixed interval of time I figure it should enter a steady state where the number of people 'exiting' is

lag_vacc <- 2 * 365.25
df$lagsum <- rep(NA, nrow(df))

for (i in (dim(df)[1] - (dim(df)[1] - lag_vacc)):dim(df)[1]) {
df$lagsum[i] <-
sum(df$nsum[(i - lag_vacc):i])
}


However, if I then plot this I get a very strange result that I can't for the life of me explain or correct.

ggplot(df,
aes(x = date_vacc, y = lagsum)) +
geom_point()


It hits the steady state, as predicted. Put then starts increasing again and ends up as 1.3 of the population, i.e. more people vaccinated than exist. This is no longer of any practical importance and even a silly way to represent this data. But I can't figure out where my reasoning is incorrect. Why doesn't this work? Is there a better way to do this?

enter image description here

EDIT: After several days of several hours each I think I finally figured this out. As a recap, the above code calculates a rolling cumulative sum of vaccinated individuals over time based on the date of their 'last' dose of a three dose vaccine. Summing the 'last' dose (representing the second or third dose depending on the situation) is desirable because two doses confer good protection for the first 4-5 years of life even without the third and last dose. Because there is a cut-off point at the end of the x-axis (31-12-2015) the individuals that would otherwise have received their third and 'last' dose after that point, instead are entering the cumulative sum prematurely because their second dose is identified as their 'last'.

Answer

Ok, so you don't expect a stable value, but rather an "oscillation" around some asymptote, right ?
There's one thing that seems a bit odd to me in your code. This line:

for (i in (dim(df)[1] - (dim(df)[1] - lag_vacc)):dim(df)[1])

, if we do the math removing parenthesis seems to end up as:

for (i in (lag_vacc:dim(df)[1])

This doesn't seem correct to me. Shouldn't it be simply:

for (i in ((dim(df)[1] - lag_vacc):dim(df)[1])

Maybe I'm wrong, but that could be the culprit.

Also, you may consider using rollapply instead, to do the cumulative sum over a moving window.