Sebastian Zeki Sebastian Zeki - 1 month ago 9
R Question

Merge based on similar but not exact dates

I have two dataframes. I would like to merge them only where the id is the same and the VisitDate is the same by no more that three days. Merging is simple but how can I specify the date range rather than exact date merge?

Here is a sample:

df1:

structure(list(V1 = structure(c(5L, 1L, 1L, 2L, 3L, 3L, 3L, 4L
), .Label = c("1", "2", "3", "4", "Id"), class = "factor"), V2 = structure(c(7L,
5L, 6L, 5L, 3L, 4L, 2L, 1L), .Label = c("2012-01-02", "2012-02-03",
"2012-02-14", "2012-03-06", "2012-05-23", "2014-07-13", "VisitDate"
), class = "factor"), V3 = structure(c(8L, 2L, 4L, 5L, 1L, 6L,
7L, 3L), .Label = c("12", "2", "22", "23", "33", "43", "54",
"Another column"), class = "factor")), .Names = c("V1", "V2",
"V3"), class = "data.frame", row.names = c(NA, -8L))


df2:

structure(list(V1 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("1",
"2", "3", "4", "Id"), class = "factor"), V2 = structure(c(5L,
4L, 3L, 2L, 1L), .Label = c("2012-01-08", "2012-02-16", "2012-05-25",
"2012-07-15", "VisitDate"), class = "factor"), V3 = structure(c(5L,
3L, 1L, 4L, 2L), .Label = c("22", "33", "43", "64", "Another column"
), class = "factor")), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-5L))


The output should be:

Id VisitDate.df1 col.2f1 VisitDate.df2 col.df2
1 2014-07-13 23 2012-07-15 43
2 2012-05-23 33 2012-05-25 22
3 2012-02-14 12 2012-02-16 64

Answer

If your data is not too large, you can simple join on id and then filter down to the rows that are off by no more than 3 days.

For example, under the tidyverse framework:

library(tidyverse)

df1 = structure(list(Id = c(1, 1, 2, 3, 3, 3, 4), 
                     VisitDate = structure(c(15483, 16264, 15483, 15384, 15405, 15373, 15341), class = "Date"),
                     Column = c(2, 4, 5, 1, 6, 7, 3)), 
                .Names = c("Id", "VisitDate", "Column"), 
                row.names = 1:7, 
                class = "data.frame")

df2 = structure(list(Id = c(1, 2, 3, 4), 
                     VisitDate = structure(c(15536, 15485, 15386, 15347), class = "Date"), 
                     Column = c(3, 1, 4, 2)), 
                .Names = c("Id", "VisitDate", "Column"), 
                row.names = 1:4, 
                class = "data.frame")


df1 %>%
    left_join(df2, by = "Id", suffix = c(".df1", ".df2")) %>%
    filter(abs(VisitDate.df1 - VisitDate.df2) <= 3)
#>   Id VisitDate.df1 Column.df1 VisitDate.df2 Column.df2
#> 1  2    2012-05-23          5    2012-05-25          1
#> 2  3    2012-02-14          1    2012-02-16          4

An alternative approach is to replicate rows of your data, perhaps in df1. If you have the same ID occurring across multiple dates, this might be more efficient.

df1 %>%
    mutate(date = map(VisitDate, function(x){seq(x - 3, x + 3, by = 1)})) %>%
    unnest(date) %>%
    inner_join(df2, by = c("Id", "date" = "VisitDate"), suffix = c(".df1", ".df2"))
#>   Id  VisitDate Column.df1       date Column.df2
#> 1  2 2012-05-23          5 2012-05-25          1
#> 2  3 2012-02-14          1 2012-02-16          4