user6571411 - 1 year ago 52

R Question

I have a dataframe

`df`

`x`

After considerable data wrangling we start the example data with dataframe

`df`

`date`

`glimpse(df)`

Observations: 50,469

Variables: 6

$ id <chr> "1000038", "1000038", "1000038", "1000128", "1000380",...

$ n_max <int> 3, 1, 1, 3, 3, 3, 3,... ###total num times before 2 years old

$ age_y <int> 0, 0, 0, 0, 1, 0, 0,... ###current age for this observation

$ age_m <int> 3, 5, 11, 3, 4,... ###current age in months for this obs

$ date_vacc <date> 2013-05-08, 2013-07-03, 2014-01-13,... ###current date obs

$ year <dbl> 2013, 2013, 2014, 2013,... ###current year of obs

glimpse(date)

Observations: 4,017

Variables: 1

$ date_vacc <date> 2005-01-01, 2005-01-02, 2005-01-03, 2005-01-04, 2005-01-05, 2005-01-06, 2005-01-07, 2005-01-08, 2005-01-09, 20...

Now I exploit the structure of

`df`

`i)`

`ii)`

`df <-`

df[!duplicated(dfid, fromLast = TRUE),] %>% ###i)

droplevels() %>%

right_join(date) %>%

group_by(date_vacc) %>%

summarise(nsum = n_distinct(id, na.rm = TRUE)) ###ii)

df$nsum <-

ifelse(is.na(df$nsum),

0,

df$nsum)

Finally, this code is

`x`

`lag_vacc <- 2 * 365.25`

df$lagsum <- rep(NA, nrow(df))

for (i in (dim(df)[1] - (dim(df)[1] - lag_vacc)):dim(df)[1]) {

df$lagsum[i] <-

sum(df$nsum[(i - lag_vacc):i])

}

However, if I then plot this I get a very strange result that I can't for the life of me explain or correct.

`ggplot(df,`

aes(x = date_vacc, y = lagsum)) +

geom_point()

It hits the steady state, as predicted. Put then starts increasing again and ends up as 1.3 of the population, i.e. more people vaccinated than exist. This is no longer of any practical importance and even a silly way to represent this data. But I can't figure out where my reasoning is incorrect. Why doesn't this work? Is there a better way to do this?

EDIT: After several days of several hours each I think I finally figured this out. As a recap, the above code calculates a rolling cumulative sum of vaccinated individuals over time based on the date of their 'last' dose of a three dose vaccine. Summing the 'last' dose (representing the second or third dose depending on the situation) is desirable because two doses confer good protection for the first 4-5 years of life even without the third and last dose. Because there is a cut-off point at the end of the x-axis (31-12-2015) the individuals that would otherwise have received their third and 'last' dose after that point, instead are entering the cumulative sum prematurely because their second dose is identified as their 'last'.

Answer Source

Ok, so you don't expect a stable value, but rather an "oscillation" around some asymptote, right ?

There's one thing that seems a bit odd to me in your code. This line:

```
for (i in (dim(df)[1] - (dim(df)[1] - lag_vacc)):dim(df)[1])
```

, if we do the math removing parenthesis seems to end up as:

```
for (i in (lag_vacc:dim(df)[1])
```

This doesn't seem correct to me. Shouldn't it be simply:

```
for (i in ((dim(df)[1] - lag_vacc):dim(df)[1])
```

Maybe I'm wrong, but that could be the culprit.

Also, you may consider using `rollapply`

instead, to do the cumulative sum over a moving window.