SoilSciGuy SoilSciGuy - 3 months ago 13
R Question

Determine missing observations (not NAs) in dataset in R

I have a fairly large data set with nearly 6000 observations in its entirety. I was provided another data set that has missing observations. Instead of filling the rows/observations with NA if measurement did not occur, the entire row/observation was omitted. This second data set has 5500 rows.

I need to determine which observations were not recorded, or in other words, which rows are missing in the second data set. I'm not referring to NAs or missing values, but observations that did not make it into the data set.

In my example below, every observation (ID) should also have "group' record of 11,12,13,14,21,22,23,24. However, ID 206902 only has groups 11, 12, 14, 21, 22, 23, 24. It's missing 13

In this example, the ID's aren't truly unique, so there should be 8 ID's.
For example, ID 206901 & group 11; ID 206901 & group 12, etc.

How can I easily determine which observations (ID) are missing? Again, there should be 8 records for each ID.

example <- structure(list(ID = c(206901L, 206901L, 206901L, 206901L, 206901L,
206901L, 206901L, 206901L, 206902L, 206902L, 206902L, 206902L,
206902L, 206902L, 206902L), group = c(11L, 12L, 13L, 14L, 21L,
22L, 23L, 24L, 11L, 12L, 14L, 21L, 22L, 23L, 24L)), .Names = c("ID",
"group"), sorted = "ID", class = c("tbl_dt", "tbl", "data.table",
"data.frame"), row.names = c(NA, -15L), .internal.selfref = <pointer: 0x0000000000100788>)


It is not exactly clear what the expected output would be, but if we need to list all the missing "group" in a particular ID among all the unique "group", then we can use %in% and ! to get the logical index of 'group' that is missing grouped by 'ID', and return the missing "group" by using the logical index "i1" if there are any missing group or else return 'NA'.

Un <- unique(example[["group"]])
example[, {i1 <- !Un %in% group
           list(NotFound = if(any(i1)) Un[i1] else NA_integer_) 
          }, ID]
#       ID NotFound
#1: 206901       NA
#2: 206902       13

Or as @Frank mentioned, we can cross-join (CJ) the unique elements in 'ID' and 'group' column, and join the output with 'example' on the 'ID', 'group' columns, specifying the 'i' as negate (!) of 'example' to output only elements that are not found in 'example'

example[, CJ(ID=ID, group=group, unique=TRUE)
                     ][!example, on = c("ID", "group")]