r.bot - 1 year ago 76
R Question

# Deduplicate observations based on window of time

I have data on a large number of individuals and there may be multiple observations per person. I want to deduplicate the data into 'episodes' of 28 days for each individual. I want to drop those records where the date of the observation is 28 days or less than the date of the start of the prior episode.

Some sample data on 6 observations of a single individual are below. The

`duplicate`
and
`new_episode`
variables are dummy variables and are not present in the original data and indicate the logic of the example.

``````dat <- data.frame(id = rep(1, 6), spec_n = seq(1,6,1),
spec_date = as.Date(c("2016/01/01", "2016/01/02", "2016/01/30",
"2016/01/31", "2016/02/02", "2016/02/28")),
duplicate = c(0,1,0,1,1,0), new_episode = c(1,0,1,0,0,1),
stringsAsFactors = FALSE)
dat
id spec_n  spec_date duplicate new_episode
1  1      1 2016-01-01         0           1
2  1      2 2016-01-02         1           0
3  1      3 2016-01-30         0           1
4  1      4 2016-01-31         1           0
5  1      5 2016-02-02         1           0
6  1      6 2016-02-28         0           1
``````

With dplyr I can calculate the time since the last observation and the time since the first episode. So deduplicating on
`date_diff`
would not provide the data I require.

``````library(dplyr)
dat <- dat %>% group_by(id) %>%
mutate(date_diff = spec_date - lag(spec_date),
earliest_spec_date = min(spec_date),
diff_earliest = spec_date - earliest_spec_date)
dat
id spec_n  spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl>  <dbl>     <date>     <dbl>       <dbl>    <time>             <date>        <time>
1     1      1 2016-01-01         0           1   NA days         2016-01-01        0 days
2     1      2 2016-01-02         1           0    1 days         2016-01-01        1 days
3     1      3 2016-01-30         0           1   28 days         2016-01-01       29 days
4     1      4 2016-01-31         1           0    1 days         2016-01-01       30 days
5     1      5 2016-02-02         1           0    2 days         2016-01-01       32 days
6     1      6 2016-02-28         0           1   26 days         2016-01-01       58 days
``````

However, this does not quite provide what I need.
`spec_n == 6`
is less than 28 days since the previous observation, but more than 28 days since the start of the last episode (
`spec_n == 3`
).

Expected output would be those rows where duplicate is 0 or new_episode is 1, e.g.

``````     id spec_n  spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl>  <dbl>     <date>     <dbl>       <dbl>    <time>             <date>        <time>
1     1      1 2016-01-01         0           1   NA days         2016-01-01        0 days
2     1      3 2016-01-30         0           1   28 days         2016-01-01       29 days
3     1      6 2016-02-28         0           1   26 days         2016-01-01       58 days
``````

This should work (its an implementation of the idea Llopis suggested I think).

I make some simulated data first:

``````df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by=1), data=rnorm(365))
date       data
1 2015-01-01 -1.4493544
2 2015-01-02 -0.8860342
3 2015-01-03  1.3629541
4 2015-01-04 -2.0131108
5 2015-01-05 -0.4527413
6 2015-01-06  0.8428585
``````

Now we write a function that takes the first date and checks if subsequent dates are more than 28 days distant from it, returning 0 if they are not and 1 if they are. If a date is 28 days away it takes that new date as the basis of future comparisons.

``````dupFinder <- function(x) {
n <- 1
N <- length(x)
res <- rep(1, N)
start <- x[n]
while (n < (N-1)) {
if (as.numeric(x[n+1]-start)>=28) {
res[n+1] <- 1
n <- n+1
start <- x[n]
}
else {
res[n+1] <- 0
n <- n+1
}
}
return(res)
}
``````

The function dupFinder will return a vector of length equal to that of your dataframe, and you can then use it to subset the dataframe to the rows of interest. Thus:

``````df[dupFinder(df\$date)==1,]
date       data
1   2015-01-01 -1.4493544
29  2015-01-29  0.2084123
57  2015-02-26  1.4541566
85  2015-03-26  0.6794230
113 2015-04-23 -0.8285670
141 2015-05-21 -0.8686872
169 2015-06-18  2.1657994
197 2015-07-16 -1.1802231
225 2015-08-13  0.1808395
253 2015-09-10 -0.4762835
281 2015-10-08 -0.3769593
309 2015-11-05  0.2825544
337 2015-12-03 -0.7132649
365 2015-12-31 -1.8111226
``````

As expected we start with the January 1, then January 29, then Feb 26, since Feb has 28 days we next get March 26th, etc.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download