stasSajin - 1 year ago 63
R Question

# Expand by ID for future periods only

Is there a way to fill in for implicit missingness for future dates based on id?

For example, imagine a experiment that starts in Jan-2016. I have 3 participants that join in at different periods. Subject 1 joins me in Jan and continues to stay until Aug. Subj 2 joins me in March, and stays in the experiment until August. Subject 3 also joins me in March, but drops out sometime in in May, so no observations are recorded for periods May-Aug.

The question is, how do I fill in the dates when subject 3 dropped out of the experiment? Here is some mock data for how things look like:

``````   Subject   Date
1        1 Jan-16
2        1 Feb-16
3        1 Mar-16
4        1 Apr-16
5        1 May-16
6        1 Jun-16
7        1 Jul-16
8        1 Aug-16
9        2 Mar-16
10       2 Apr-16
11       2 May-16
12       2 Jun-16
13       2 Jul-16
14       2 Aug-16
15       3 Mar-16
16       3 Apr-16

structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L), Date = structure(c(5L, 4L, 8L, 2L,
9L, 7L, 6L, 3L, 8L, 2L, 9L, 7L, 6L, 3L, 8L, 2L), .Label = c("",
"Apr-16", "Aug-16", "Feb-16", "Jan-16", "Jul-16", "Jun-16", "Mar-16",
"May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L), .Names = c("Subject", "Date"))
``````

And here is how the data should look like:

``````   Subject   Date
1        1 Jan-16
2        1 Feb-16
3        1 Mar-16
4        1 Apr-16
5        1 May-16
6        1 Jun-16
7        1 Jul-16
8        1 Aug-16
9        2 Mar-16
10       2 Apr-16
11       2 May-16
12       2 Jun-16
13       2 Jul-16
14       2 Aug-16
15       3 Mar-16
16       3 Apr-16
17       3 May-16
18       3 Jun-16
19       3 Jul-16
20       3 Aug-16

structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(4L,
3L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L,
6L, 5L, 2L), .Label = c("Apr-16", "Aug-16", "Feb-16", "Jan-16",
"Jul-16", "Jun-16", "Mar-16", "May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L), .Names = c("Subject", "Date"))
``````

I tried using
`expand`
from
`tidyr`
and
`TimeFill`
from
`DataCombine`
package, but the issue with those approaches is that I would get dates for periods before a participant joined an experiment. In this particular instance, I only want the periods to be filled for cases when a participant drops out of an experiment.

The `complete` function from `tidyr` is designed for turning implicit missing values into explicit missing values. We will have to do some filtering to not include past completion. The easiest way seems to be to do a join on a table with starting values:

``````library(dplyr)
library(tidyr)

df <- df %>%
filter(Date != '') %>%
droplevels() %>%
group_by(Subject)

df2 <- summarise(df, start = first(Date))

df %>%
complete(Subject, Date) %>%
left_join(df2) %>%
mutate(Date2 = as.Date(paste0('01-', Date), format = '%d-%b-%y'),
start = as.Date(paste0('01-', start), format = '%d-%b-%y')) %>%
filter(Date2 >= start) %>%
arrange(Subject, Date2) %>%
select(-start, -Date2)
``````

Result:

``````Source: local data frame [20 x 2]
Groups: Subject [3]

Subject   Date
<int> <fctr>
1        1 Jan-16
2        1 Feb-16
3        1 Mar-16
4        1 Apr-16
5        1 May-16
6        1 Jun-16
7        1 Jul-16
8        1 Aug-16
9        2 Mar-16
10       2 Apr-16
11       2 May-16
12       2 Jun-16
13       2 Jul-16
14       2 Aug-16
15       3 Mar-16
16       3 Apr-16
17       3 May-16
18       3 Jun-16
19       3 Jul-16
20       3 Aug-16
``````

I use date conversion to do a reliable comparison with the starting date, but you might be able to use the `row_number`s somehow. A problem is that `complete` will rearrange the data.

p.s. Note that your example `dput` has an empty factor level (`""`), so I filter that out first.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download