Rappster - 5 months ago 26

R Question

I'd like to use

`identical()`

`mutate()`

Consider the following example:

`dat <- data.frame(x = 1:4, y = c(1, 2, 10, NA))`

I'd like to check if

`y`

`x`

`mutate(dat, diff = x != y)`

# x y diff

# 1 1 1 FALSE

# 2 2 2 FALSE

# 3 3 10 TRUE

# 4 4 NA NA

Has "problems" with NA, so I turned to identical:

`mutate(dat, diff = !identical(x, y))`

# x y diff

# 1 1 1 TRUE

# 2 2 2 TRUE

# 3 3 10 TRUE

# 4 4 NA TRUE

Hm, that's kinda strange >> investigated and found out it had to do with diverging data types:

`class(dat$x)`

# [1] "integer"

class(dat$y)

# [1] "numeric"

So let's take care of aligning that:

`dat$x <- as.numeric(dat$x)`

dat$y <- as.numeric(dat$y)

Now, I would intuitively think that mutate would give me the same result as this:

`sapply(1:nrow(dat), function(ii) {`

!identical(dat[ii, "x"], dat[ii, "y"])

})

# [1] FALSE FALSE TRUE TRUE

But it still gives me this:

`mutate(dat, diff = !identical(x, y))`

# x y diff

# 1 1 1 TRUE

# 2 2 2 TRUE

# 3 3 10 TRUE

# 4 4 NA TRUE

while I'd expect this

`# x y diff`

# 1 1 1 FALSE

# 2 2 2 FALSE

# 3 3 10 TRUE

# 4 4 NA TRUE

What's the reason for this and/or how would I work around this so I could still use

`mutate`

Wow, what a difference in speed!

`identicalVectorized <- function(x, y) {`

(x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))

}

identicalVectorized2 <- function(dat, x, y) {

dat$diff <- sapply(1:nrow(dat), function(ii) {

!identical(dat[ii, x], dat[ii, y])

})

dat

}

dat <- data.frame(x = c(1:4,NA, NA), y = c(1, 2, 10, NA, 15, NA))

microbenchmark::microbenchmark(

mutate(dat, diff = identicalVectorized(x, y)),

mutate(dat, diff = identicalVectorized2(dat, x = "x", y = "y"))

)

Result

`Unit: microseconds`

expr min lq mean median

mutate(dat, diff = identicalVectorized(x, y)) 31.965 35.190 40.58286 39.0020

mutate(dat, diff = identicalVectorized2(dat, x = "x", y = "y")) 195.303 211.433 226.42343 215.0985

uq max neval

41.6420 83.283 100

224.3355 384.743 100

Answer

This might be your best bet:

```
dat <- data.frame(x = c(1:4,NA), y = c(1, 2, 10, NA, 15))
mutate(dat, diff = x != y | is.na(x) | is.na(y))
```

If you want NA==NA to be TRUE (it isn't in R) use this:

```
mutate(dat, diff = (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y)))
```