afleishman afleishman - 1 month ago 7
R Question

filter duplicates from a data frame in r

I have a dataframe with one observation per row and two observations per subject. I'd like to filter out just the rows with duplicate 'day' numbers.

ex <- data.frame('id'= rep(1:5,2), 'day'= c(1:5, 1:3,5:6))


The following code filters out just the second duplicated row, but not the first. Again, I'd like to filter out both of the duplicated rows.

ex %>%
group_by(id) %>%
filter(duplicated(day))


The following code works, but seems clunky. Does anyone have a more efficient solution?

ex %>%
group_by(id) %>%
filter(duplicated(day, fromLast = TRUE) | duplicated(day, fromLast = FALSE))

Answer

duplicated can be applied on the whole dataset and this can be done with just base R methods.

ex[duplicated(ex)|duplicated(ex, fromLast = TRUE),]

Using dplyr, we can group_by both the columns and filter only when the number of rows (n()) is greater than 1.

ex %>% 
     group_by(id, day) %>%
     filter(n()>1)