Kirk Fogg Kirk Fogg - 6 months ago 23
R Question

Filtering dataset on closest date for multiple variables

I'm working with data where each subject has several variables, and a date on which each variable was measured. The variables are measured multiple times for each subject.

I have a separate file with, say, "target dates" for each subject. I want to filter the original dataset and include the only the variable measurements closest to the target date for each subject.

Here's a simple example:

library(dplyr)

set.seed(1234)

data <- data.frame(dates = sample(seq(as.Date('1999/01/01'),
as.Date('2010/01/01'), by="day"), 200),
ids = rep(1:50, 2))
data <- arrange(data, ids)
data$vars <- rep(1:2, 100)

target.date <- data.frame(dates = sample(seq(as.Date('1999/01/01'),
as.Date('2010/01/01'), by="day"), 50),
ids = rep(1:50))


For each subject, there are two variables each measured twice. I want to include only the measurement of each variable closest to the date in
target.date
.

I tried using dplyr like this:

data.sub <- data %>%
group_by(id, vars) %>%


but I'm not sure how to proceed, since I somehow have to reference the date in
target.date
associated with each of the
ids
. I'd also like to choose the closest date without going over the target date, but may not be possible for some subjects in this example I created.

Here is a snippet of the output I'm trying to get at:

dates ids vars
1 2000-04-01 1 1
2 2003-08-26 1 2
3 2005-01-22 2 1
4 2002-05-13 2 2
.
.
.


For subject 2, both dates for
vars==1
are after the target date. But I included it in anyway for simplicity. But if possible, I'd like to take the dates closest to the target date without going over it.

Answer Source

Here is what I think you need. It is not very clear from your question and you did not provide expected output given the sample data. I can modify based on your feedback.

data <- inner_join(data, target.date, by = 'ids')
data$days <- difftime(data$dates.x, data$dates.y, units = 'days')
data$days <- abs(as.integer(data$days))
data %>%
  group_by(ids, vars) %>%
  filter(days == min(days)) %>%
  slice(1)