Pelle Pelle - 1 month ago 10
R Question

R: Compare two data.frames and delete data that not match

I have two data.frame time series. The first is complete, the second is containing only the time steps with correct values of one of the variables. I need to keep all other variables but want to NA all the variables which appear in the second df. Here the example:

library(dplyr)
library(tidyverse)
library(lubridate)

#test data:
TDF <- tibble(DATE = seq( make_datetime(2007,09,23,06,00), make_datetime(2008,07,05,23,00), by = 600),
V1 = round(runif(length(DATE)),2),
V2 = round(runif(length(DATE)),2),
V3 = round(runif(length(DATE)),2))
TDF2 <- TDF
TDF2 <- TDF2[TDF2$V1>0.7,]


Output:

> TDF
# A tibble: 41,287 × 4
DATE V1 V2 V3
<dttm> <dbl> <dbl> <dbl>
1 2007-09-23 06:00:00 0.89 0.21 0.03
2 2007-09-23 06:10:00 0.26 0.54 0.70
3 2007-09-23 06:20:00 0.74 0.22 0.80
4 2007-09-23 06:30:00 0.31 0.48 0.38
5 2007-09-23 06:40:00 0.93 0.26 0.21

> TDF2
# A tibble: 11,972 × 4
DATE V1 V2 V3
<dttm> <dbl> <dbl> <dbl>
1 2007-09-23 06:00:00 0.89 0.21 0.03
2 2007-09-23 06:20:00 0.74 0.22 0.80
3 2007-09-23 06:40:00 0.93 0.26 0.21
4 2007-09-23 07:20:00 0.91 0.36 0.83
5 2007-09-23 07:40:00 0.95 0.87 0.91


And this is what I need:

> TDF_modified
# A tibble: 41,287 × 4
DATE V1 V2 V3
<dttm> <dbl> <dbl> <dbl>
1 2007-09-23 06:00:00 0.89 0.21 0.03
2 2007-09-23 06:10:00 NA 0.54 0.70
3 2007-09-23 06:20:00 0.74 0.22 0.80
4 2007-09-23 06:30:00 NA 0.48 0.38
5 2007-09-23 06:40:00 0.93 0.26 0.21


I need a general solution just by using the two data.frames. I could do it via loop but I bet there is a more elegant way. Thanks a lot in advanced!

Answer

It is not entirely clear to me what you are looking for, but based on your example output, it appears that you want to keep the results for V2 and V3 from TDF, and only remove the V1 reading.

One approach would be to use left_join without the V1 column from TDF and with only the V1 column from TDF2:

left_join(
  select(TDF, -V1)
  , select(TDF2, DATE, V1) )

Gives:

                  DATE    V2    V3    V1
                <dttm> <dbl> <dbl> <dbl>
1  2007-09-23 06:00:00  0.14  0.62    NA
2  2007-09-23 06:10:00  0.87  0.05  0.87
3  2007-09-23 06:20:00  0.20  0.52    NA
4  2007-09-23 06:30:00  0.34  0.01    NA
5  2007-09-23 06:40:00  0.92  0.37  0.83
6  2007-09-23 06:50:00  0.94  0.27    NA
7  2007-09-23 07:00:00  0.98  0.49    NA
8  2007-09-23 07:10:00  0.70  0.98    NA
9  2007-09-23 07:20:00  0.05  0.55  0.72
10 2007-09-23 07:30:00  0.16  0.12  0.99

from my randomized data.

If you are looking for a more complete solution, you could do a full_join, which will yield duplicated columns, and then handle them as you wish to. E.g., this returns NA for V1 if TDF2$V1 is missing, but for V2 and V3 it gives the TDF value if missing, and only gives NA if the two values are present and different. Since I am not sure what you want to use, you may need to add more complicated checks here.

full_join(TDF, TDF2, "DATE") %>%
  mutate(V1 = ifelse(is.na(V1.y), NA, V1.x)
         , V2 = ifelse(is.na(V2.y), V2.x
                       , ifelse(V2.x == V2.y, V2.x, NA))
         , V3 = ifelse(is.na(V3.y), V3.x
                       , ifelse(V3.x == V3.y, V3.x, NA))
         )

Returns:

# A tibble: 41,287 × 10
                  DATE  V1.x  V2.x  V3.x  V1.y  V2.y  V3.y    V1    V2    V3
                <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  2007-09-23 06:00:00  0.62  0.14  0.62    NA    NA    NA    NA  0.14  0.62
2  2007-09-23 06:10:00  0.87  0.87  0.05  0.87  0.87  0.05  0.87  0.87  0.05
3  2007-09-23 06:20:00  0.53  0.20  0.52    NA    NA    NA    NA  0.20  0.52
4  2007-09-23 06:30:00  0.03  0.34  0.01    NA    NA    NA    NA  0.34  0.01
5  2007-09-23 06:40:00  0.83  0.92  0.37  0.83  0.92  0.37  0.83  0.92  0.37
6  2007-09-23 06:50:00  0.70  0.94  0.27    NA    NA    NA    NA  0.94  0.27
7  2007-09-23 07:00:00  0.51  0.98  0.49    NA    NA    NA    NA  0.98  0.49
8  2007-09-23 07:10:00  0.65  0.70  0.98    NA    NA    NA    NA  0.70  0.98
9  2007-09-23 07:20:00  0.72  0.05  0.55  0.72  0.05  0.55  0.72  0.05  0.55
10 2007-09-23 07:30:00  0.99  0.16  0.12  0.99  0.16  0.12  0.99  0.16  0.12
# ... with 41,277 more rows

(And, to clean up the duplicated columns, simply use select once you are done with them.)