SHW SHW - 11 days ago 4x
R Question

Identify duplicate together with original observation in R (maybe by clustering)

I have the suspicion that respondents are cheating. I have found duplicate answers, but if I only use the duplicated() and/or unique() function, I only get either the duplicates (without origin) or the unique values (without the duplicates). I want to know which one are duplicates from which observations. Is there a function in R with which I can easily find which observations have the same answer pattern?

id <- c("l","l","l","p","p","a","a","a")
show <- c("broadway","cats","alladin","broadway","cats","broadway","cats","alladin")
v1 <- c(1,2,2,1,3,1,2,1)
v2 <- c(1,2,2,2,4,1,2,3)
v3 <- c(1,2,2,5,1,1,2,4)
df <- data.frame(id,show,v1,v2,v3); df

Here's the script I've used to identify the duplicates. I am only interested in duplicates occurring in the numerical part of the dataframe, hence I only select columns 3 to 5.

#script I'm using to find duplicates
duplicates <- data.frame(which(duplicated(df[,3:5])))

This question is not a duplicate of Identify duplicates and mark first occurrence and all others, because I am not interested in a binary output. A solution that would be of great help to me is if I could identify which clusters of duplicates exist. In this case, df[6,] is a duplicate of df[1,] (cluster 1) and df[3,] and df[7,] are duplicates of df[2,] (cluster 2)

Using Wietze's solution with the dplyr package led to a good solution:

df %>% group_by(v1, v2, v3) %>% filter(n() > 1)

Since I am not really familiar with the grammar that is used in dplyr, I have one more question. It looks like a column (n) is added at the end of the dataframe, but if I save the function as an object and ask for the final column, it doesn't return to me the n. How can I, using this solution, find my way back to my original dataframe with the n column added?


Using dplyr package:


#filter on n, do not create new column
df %>% group_by(v1, v2, v3) %>% filter(n() > 1)

#filter on n, create new column
df %>% group_by(v1, v2, v3) %>% mutate(n = n()) %>% filter(n > 1)