daRknight daRknight - 1 month ago 6
R Question

Selecting non `NA` values from duplicate rows with `data.table`

I have a

data.table
and need to do some cleaning for multiple values that should be unique -- a simplified example should help:

> DT
id type
1: 1 A
2: 2 X
3: 3 X
4: 3 G
5: 3 NA
6: 4 D
7: 5 NA


The issue I'm having is multiple values for "type" for the same ID, or:

> DT[id == 3]
id type
1: 3 X
2: 3 G
3: 3 NA


The source of the multiple values is irrelevant, but I'd like to clean this in a specific manner: take the last reported value unless it's an
NA
. So the cleaning should result in a single row per ID, and in the example case would look like:

> DTclean
id type
1: 1 A
2: 2 X
3: 3 G
4: 4 D
5: 5 NA


My current approach is to sort
DT
by
type
, so that all the
NA
's are first, and then use
duplicated
-- this is acceptable, but I feel there is a better method, plus though not crucial, this doesn't always take the last reported value -- in the above case it takes X instead of G.

This is my current approach:

> setorder(DT, type)
> DTclean <- DT[!duplicated(id, fromLast = T)]
> DTclean
id type
1: 1 A
2: 2 X
3: 3 X
4: 4 D
5: 5 NA


Any ideas/help would be much appreciated!

Answer

A slight variation on your idea should work. The concept is that you want a separate column indicating only whether or not "type" is NA, not the value, and sort by that column and then the original ID:

> DT$typena<-is.na(DT$type)
> setorderv(DT,c('typena','id'),order=c(-1,1))
> DT[!duplicated(id,fromLast=T)]
   id type typena
1:  5   NA   TRUE
2:  1    A  FALSE
3:  2    X  FALSE
4:  3    G  FALSE
5:  4    D  FALSE

You can then remove the extra column again if you don't want it in the way by assigning NULL to it, or do it all in one step:

DT[!duplicated(id,fromLast=T),c('id','type'),with=F]

Here is a dput of the sample data above:

DT <- structure(list(id = c(1L, 2L, 3L, 3L, 3L, 4L, 5L), type = c("A", 
      "X", "X", "G", NA, "D", NA)), .Names = c("id", "type"), 
      row.names =  c(NA, -7L), class = c("data.table", "data.frame"))