Rappster Rappster - 3 months ago 15
R Question

Problems with using identical() in dplyr::mutate()

I'd like to use

identical()
inside
mutate()
and I'm getting "strange" results. Am I missing something here or is this a bug?

Consider the following example:

dat <- data.frame(x = 1:4, y = c(1, 2, 10, NA))


I'd like to check if
y
differs from
x
:

mutate(dat, diff = x != y)
# x y diff
# 1 1 1 FALSE
# 2 2 2 FALSE
# 3 3 10 TRUE
# 4 4 NA NA


Has "problems" with NA, so I turned to identical:

mutate(dat, diff = !identical(x, y))
# x y diff
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE


Hm, that's kinda strange >> investigated and found out it had to do with diverging data types:

class(dat$x)
# [1] "integer"
class(dat$y)
# [1] "numeric"


So let's take care of aligning that:

dat$x <- as.numeric(dat$x)
dat$y <- as.numeric(dat$y)


Now, I would intuitively think that mutate would give me the same result as this:

sapply(1:nrow(dat), function(ii) {
!identical(dat[ii, "x"], dat[ii, "y"])
})
# [1] FALSE FALSE TRUE TRUE


But it still gives me this:

mutate(dat, diff = !identical(x, y))
# x y diff
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE


while I'd expect this

# x y diff
# 1 1 1 FALSE
# 2 2 2 FALSE
# 3 3 10 TRUE
# 4 4 NA TRUE


What's the reason for this and/or how would I work around this so I could still use
mutate
(which I really like)?




Update



Wow, what a difference in speed!

identicalVectorized <- function(x, y) {
(x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))
}

identicalVectorized2 <- function(dat, x, y) {
dat$diff <- sapply(1:nrow(dat), function(ii) {
!identical(dat[ii, x], dat[ii, y])
})
dat
}

dat <- data.frame(x = c(1:4,NA, NA), y = c(1, 2, 10, NA, 15, NA))

microbenchmark::microbenchmark(
mutate(dat, diff = identicalVectorized(x, y)),
mutate(dat, diff = identicalVectorized2(dat, x = "x", y = "y"))
)


Result

Unit: microseconds
expr min lq mean median
mutate(dat, diff = identicalVectorized(x, y)) 31.965 35.190 40.58286 39.0020
mutate(dat, diff = identicalVectorized2(dat, x = "x", y = "y")) 195.303 211.433 226.42343 215.0985
uq max neval
41.6420 83.283 100
224.3355 384.743 100

Answer

This might be your best bet:

dat <- data.frame(x = c(1:4,NA), y = c(1, 2, 10, NA, 15))
mutate(dat, diff = x != y | is.na(x) | is.na(y))

If you want NA==NA to be TRUE (it isn't in R) use this:

mutate(dat, diff = (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y)))