stasSajin stasSajin - 2 months ago 7
R Question

Expand by ID for future periods only

Is there a way to fill in for implicit missingness for future dates based on id?

For example, imagine a experiment that starts in Jan-2016. I have 3 participants that join in at different periods. Subject 1 joins me in Jan and continues to stay until Aug. Subj 2 joins me in March, and stays in the experiment until August. Subject 3 also joins me in March, but drops out sometime in in May, so no observations are recorded for periods May-Aug.

The question is, how do I fill in the dates when subject 3 dropped out of the experiment? Here is some mock data for how things look like:

Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16




structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L), Date = structure(c(5L, 4L, 8L, 2L,
9L, 7L, 6L, 3L, 8L, 2L, 9L, 7L, 6L, 3L, 8L, 2L), .Label = c("",
"Apr-16", "Aug-16", "Feb-16", "Jan-16", "Jul-16", "Jun-16", "Mar-16",
"May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L), .Names = c("Subject", "Date"))


And here is how the data should look like:

Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
17 3 May-16
18 3 Jun-16
19 3 Jul-16
20 3 Aug-16

structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(4L,
3L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L,
6L, 5L, 2L), .Label = c("Apr-16", "Aug-16", "Feb-16", "Jan-16",
"Jul-16", "Jun-16", "Mar-16", "May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L), .Names = c("Subject", "Date"))


I tried using
expand
from
tidyr
and
TimeFill
from
DataCombine
package, but the issue with those approaches is that I would get dates for periods before a participant joined an experiment. In this particular instance, I only want the periods to be filled for cases when a participant drops out of an experiment.

Answer

The complete function from tidyr is designed for turning implicit missing values into explicit missing values. We will have to do some filtering to not include past completion. The easiest way seems to be to do a join on a table with starting values:

library(dplyr)
library(tidyr)

df <- df %>% 
  filter(Date != '') %>% 
  droplevels() %>% 
  group_by(Subject)

df2 <- summarise(df, start = first(Date))

df %>%
  complete(Subject, Date) %>%
  left_join(df2) %>% 
  mutate(Date2 = as.Date(paste0('01-', Date), format = '%d-%b-%y'),
         start = as.Date(paste0('01-', start), format = '%d-%b-%y')) %>% 
  filter(Date2 >= start) %>%
  arrange(Subject, Date2) %>% 
  select(-start, -Date2)

Result:

Source: local data frame [20 x 2]
Groups: Subject [3]

   Subject   Date
     <int> <fctr>
1        1 Jan-16
2        1 Feb-16
3        1 Mar-16
4        1 Apr-16
5        1 May-16
6        1 Jun-16
7        1 Jul-16
8        1 Aug-16
9        2 Mar-16
10       2 Apr-16
11       2 May-16
12       2 Jun-16
13       2 Jul-16
14       2 Aug-16
15       3 Mar-16
16       3 Apr-16
17       3 May-16
18       3 Jun-16
19       3 Jul-16
20       3 Aug-16

I use date conversion to do a reliable comparison with the starting date, but you might be able to use the row_numbers somehow. A problem is that complete will rearrange the data.

p.s. Note that your example dput has an empty factor level (""), so I filter that out first.