Misha V Misha V - 1 month ago 6
R Question

`cor()` gives inconsistent results when given the whole matrix and when given just a pair of columns

I have a matrix with a lot of missing values and I am trying to compute correlations between the columns.

To deal with the missing values, I use

cor(matrix,use="complete")


This gives a matrix with no NA values as desired. However, if I do a pairwise correlation between two of the columns A and B

cor(matrix[,A],matrix[,B],use="complete")


I get a different result than the one in the [A,B] entry in the matrix.

Looking a plot between the two variables, it seems like the second result is more reasonable.

Where could this discrepancy come from?

Answer

You are asking the difference between "complete.obs" and "pairwise.complete.obs".

## example matrix
set.seed(0);X <- matrix(rnorm(10*3),ncol=3)
X[1:2,1] <- NA
X[3:4,2] <- NA
X[5:6,3] <- NA

#              [,1]       [,2]        [,3]
# [1,]           NA  0.7635935 -0.22426789
# [2,]           NA -0.7990092  0.37739565
# [3,]  1.329799263         NA  0.13333636
# [4,]  1.272429321         NA  0.80418951
# [5,]  0.414641434 -0.2992151          NA
# [6,] -1.539950042 -0.4115108          NA
# [7,] -0.928567035  0.2522234  1.08576936
# [8,] -0.294720447 -0.8919211 -0.69095384
# [9,] -0.005767173  0.4356833 -1.28459935
#[10,]  2.404653389 -1.2375384  0.04672617

## complete
cor(X, use = "complete.obs")
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

## pairwise
cor(X, use = "pairwise.complete.obs")
#            [,1]       [,2]        [,3]
#[1,]  1.00000000 -0.5531396  0.08229729
#[2,] -0.55313958  1.0000000 -0.10786401
#[3,]  0.08229729 -0.1078640  1.00000000

For use = "complete.obs", any rows with at least one NA will be dropped. So it essentially does

X1 <- X[7:10, ]  ## only the last 4 rows have no `NA`
cor(X1)
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

Here, the (1,2) or (2,1) entry -0.69629279 is computed with only 4 data. However, if you do pairwise, it can be computed with 6 data:

cor(X[5:10, 1], X[5:10, 2])
# [1] -0.5531396