Visser Visser - 2 months ago 12
R Question

Removal of duplicates from specific columns/rows only

I have a dataset with a lot of duplicate TIME and AIR data entries. I need to remove all of the rows that contain these entries. This in itself would be extremely simple:

DATE <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3)
#Supposed to be 4 entries per date; day 1 has 1 entry too many, day 2 has 2 entries too many
TIME <- c(0,3,3,6,9,0,0,3,6,9,9,0,3,6,9)
#Samples are taken every 3 hours from 0-9 each day
AIR <- c(1.1, 1.2, 1.2, 1.3, 1.4, 1.1, 1.1, 1.2, 1.3, 1.4, 1.4, 1.1, 1.2, 1.3, 1.4)

DF <- data.frame(DATE, TIME, AIR)

DF1 <- DF[!(duplicated(DF$TIME) & duplicated(DF$AIR)),]


DF1
would output the following:

DATE TIME AIR
1 1 0 1.1
2 1 3 1.2
4 1 6 1.3
5 1 9 1.4


However, I need to consider each day, or DATE, individually, the output I am actually looking for is:

DATE TIME AIR
1 0 1.1
1 3 1.2
1 6 1.3
1 9 1.4
2 0 1.1
2 3 1.2
2 6 1.3
2 9 1.4
3 0 1.1
3 3 1.2
3 6 1.3
3 9 1.4


In this case the duplication stops for each given day. I thought it would be as straightforward as slightly adapting my piece of code, it transpires that it is not, I have tried many variations of:
DF2 <- DF[!(duplicated(DF$TIME) & duplicated(DF$AIR)) & duplicated(DF$DATE),]
to no avail. Therefore it probably requires something slightly more complicated - or at least beyond my current means. Could anyone advise on what I would need to do in order to consider each day seperately in this way? Thank you so much!

Note that in reality it may be that
TIME
is duplicated but
AIR
is not; this is why I have to remove only those entries where
TIME
and
AIR
are duplicates.

Answer

You need to group, i.e. (with dplyr)

library(dplyr)
DF %>% 
  group_by(DATE) %>% 
  filter(!(duplicated(TIME) & duplicated(AIR))