raistlin raistlin - 2 months ago 12
R Question

Removing duplicated rows in df based on column of matches in R

My

df
has a
link
column (of type list) that specifies if a record has a match within
df
(ie. if it has duplicates).

df <- data.frame(id=1:7,link=I(list(c(2,3),c(1,3),c(1,2),NA,NA,7,6)))

id link
1 1 2, 3
2 2 1, 3
3 3 1, 2
4 4 NA
5 5 NA
6 6 7
7 7 6


I would like to subset
df
to keep only the first matched row (ie. with respect to order by id) for those rows with linked records. I want:

id link
1 1 2, 3
2 4 NA
3 5 NA
4 6 7


I have tried a loop to store the
id
values of the rows to be removed from
df
in
to_remove
. It doesn't quite work at the moment and I feel I am overthinking this.

to_remove <- character(0)
for (n in 1:nrow(df)) {
links <- df$link[[n]]
if (all(is.na(links))) next # skip if no links available
add <- ifelse(links %in% to_remove, NA,links)
add <- add[!is.na(add)]
if (length(add > 0)) to_remove <- c(to_remove,add)
}


Can I do this in a simpler way and avoid loops?

Answer

Using:

library(data.table)
DT <- data.table(id = rep(df$id, lengths(df$link)), link = unlist(df$link))
DT[DT[, .I[!any(id > link) | is.na(link)], by = id]$V1][, .(link = toString(link)), by = id]

gives:

   id link
1:  1 2, 3
2:  4   NA
3:  5   NA
4:  6    7

Explanation:

  • First create a new data.frame/data.table by unlisting the lists in the cells of df$link and create a new data.frame/data.table.
  • Then create an index for which the conditions for inclusion are met and select this subset.
  • At last, convert the values in the link column back to a list for each id.

Or using a dplyr/tidyr combination:

library(dplyr)
library(tidyr)
df %>% 
  unnest(link) %>% 
  group_by(id) %>% 
  filter(!any(id > link) | is.na(link)) %>% 
  summarise(link = toString(link))

which gives a similar result:

# A tibble: 4 × 2
     id  link
  <int> <chr>
1     1  2, 3
2     4    NA
3     5    NA
4     6     7
Comments