kradja - 9 months ago 48

R Question

Hey guys I definitely solved this problem before but I lost my code...

Here is a simplification of what I have.

`a1 <- c(1,2,4,3,5)`

a2 <- c("a","b","b","c","f")

a3 <- c(3,4,"b",1,9)

a4 <- c("c","b",2,"a","d")

a <- cbind(a1,a2,a3,a4)

a1 and a2 are a set as well as a3 and a4

I would like to remove the duplicates. So remove rows 3 and 4. This data comes from a blast showing links between genomes and it is 34,000 rows long so a efficient solution would be great.

Thank you so much! I would also be open to doing this in another language

Answer

We can `sort`

the 'a' by row, get the logical index of not (`!`

) `duplicated`

elements and use that to filter the rows.

```
i1 <- !duplicated(t(apply(a, 1, sort)))
a1 <- a[i1,]
```

The index of rows that remains in the dataset are

```
which(i1)
#[1] 1 2 5
```

Source (Stackoverflow)