Rappster - 1 year ago 74
R Question

# Problems with using identical() in dplyr::mutate()

I'd like to use

`identical()`
inside
`mutate()`
and I'm getting "strange" results. Am I missing something here or is this a bug?

Consider the following example:

``````dat <- data.frame(x = 1:4, y = c(1, 2, 10, NA))
``````

I'd like to check if
`y`
differs from
`x`
:

``````mutate(dat, diff = x != y)
# x  y  diff
# 1 1  1 FALSE
# 2 2  2 FALSE
# 3 3 10  TRUE
# 4 4 NA    NA
``````

Has "problems" with NA, so I turned to identical:

``````mutate(dat, diff = !identical(x, y))
# x  y diff
# 1 1  1 TRUE
# 2 2  2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE
``````

Hm, that's kinda strange >> investigated and found out it had to do with diverging data types:

``````class(dat\$x)
# [1] "integer"
class(dat\$y)
# [1] "numeric"
``````

So let's take care of aligning that:

``````dat\$x <- as.numeric(dat\$x)
dat\$y <- as.numeric(dat\$y)
``````

Now, I would intuitively think that mutate would give me the same result as this:

``````sapply(1:nrow(dat), function(ii) {
!identical(dat[ii, "x"], dat[ii, "y"])
})
# [1]  FALSE FALSE TRUE TRUE
``````

But it still gives me this:

``````mutate(dat, diff = !identical(x, y))
# x  y diff
# 1 1  1 TRUE
# 2 2  2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE
``````

while I'd expect this

``````# x  y diff
# 1 1  1 FALSE
# 2 2  2 FALSE
# 3 3 10 TRUE
# 4 4 NA TRUE
``````

What's the reason for this and/or how would I work around this so I could still use
`mutate`
(which I really like)?

## Update

Wow, what a difference in speed!

``````identicalVectorized <- function(x, y) {
(x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))
}

identicalVectorized2 <- function(dat, x, y) {
dat\$diff <- sapply(1:nrow(dat), function(ii) {
!identical(dat[ii, x], dat[ii, y])
})
dat
}

dat <- data.frame(x = c(1:4,NA, NA), y = c(1, 2, 10, NA, 15, NA))

microbenchmark::microbenchmark(
mutate(dat, diff = identicalVectorized(x, y)),
mutate(dat, diff = identicalVectorized2(dat, x = "x", y = "y"))
)
``````

Result

``````Unit: microseconds
expr     min      lq      mean   median
mutate(dat, diff = identicalVectorized(x, y))  31.965  35.190  40.58286  39.0020
mutate(dat, diff = identicalVectorized2(dat, x = "x", y = "y")) 195.303 211.433 226.42343 215.0985
uq     max neval
41.6420  83.283   100
224.3355 384.743   100
``````

``````dat <- data.frame(x = c(1:4,NA), y = c(1, 2, 10, NA, 15))
``````mutate(dat, diff = (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y)))