Misha V Misha V - 1 year ago 52
R Question

`cor()` gives inconsistent results when given the whole matrix and when given just a pair of columns

I have a matrix with a lot of missing values and I am trying to compute correlations between the columns.

To deal with the missing values, I use

cor(matrix,use="complete")


This gives a matrix with no NA values as desired. However, if I do a pairwise correlation between two of the columns A and B

cor(matrix[,A],matrix[,B],use="complete")


I get a different result than the one in the [A,B] entry in the matrix.

Looking a plot between the two variables, it seems like the second result is more reasonable.

Where could this discrepancy come from?

Answer Source

You are asking the difference between "complete.obs" and "pairwise.complete.obs".

## example matrix
set.seed(0);X <- matrix(rnorm(10*3),ncol=3)
X[1:2,1] <- NA
X[3:4,2] <- NA
X[5:6,3] <- NA

#              [,1]       [,2]        [,3]
# [1,]           NA  0.7635935 -0.22426789
# [2,]           NA -0.7990092  0.37739565
# [3,]  1.329799263         NA  0.13333636
# [4,]  1.272429321         NA  0.80418951
# [5,]  0.414641434 -0.2992151          NA
# [6,] -1.539950042 -0.4115108          NA
# [7,] -0.928567035  0.2522234  1.08576936
# [8,] -0.294720447 -0.8919211 -0.69095384
# [9,] -0.005767173  0.4356833 -1.28459935
#[10,]  2.404653389 -1.2375384  0.04672617

## complete
cor(X, use = "complete.obs")
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

## pairwise
cor(X, use = "pairwise.complete.obs")
#            [,1]       [,2]        [,3]
#[1,]  1.00000000 -0.5531396  0.08229729
#[2,] -0.55313958  1.0000000 -0.10786401
#[3,]  0.08229729 -0.1078640  1.00000000

For use = "complete.obs", any rows with at least one NA will be dropped. So it essentially does

X1 <- X[7:10, ]  ## only the last 4 rows have no `NA`
cor(X1)
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

Here, the (1,2) or (2,1) entry -0.69629279 is computed with only 4 data. However, if you do pairwise, it can be computed with 6 data:

cor(X[5:10, 1], X[5:10, 2])
# [1] -0.5531396