kradja kradja - 7 months ago 37
R Question

R deleting duplicates in other columns

Hey guys I definitely solved this problem before but I lost my code...
Here is a simplification of what I have.

a1 <- c(1,2,4,3,5)
a2 <- c("a","b","b","c","f")
a3 <- c(3,4,"b",1,9)
a4 <- c("c","b",2,"a","d")
a <- cbind(a1,a2,a3,a4)

a1 and a2 are a set as well as a3 and a4

I would like to remove the duplicates. So remove rows 3 and 4. This data comes from a blast showing links between genomes and it is 34,000 rows long so a efficient solution would be great.

Thank you so much! I would also be open to doing this in another language


We can sort the 'a' by row, get the logical index of not (!) duplicated elements and use that to filter the rows.

i1 <- !duplicated(t(apply(a, 1, sort)))
a1 <- a[i1,]

The index of rows that remains in the dataset are

#[1] 1 2 5