Bobe Kryant - 1 year ago 151
Python Question

# numpy covariance and covariance matrix by formula is producing different results

I generate a matrix that I want to get the covariance of:

``````test=np.array([4,2,.6,4.2,2.1,.59,3.9,2,.58,4.3,2.1,.62,4.1,2.2,.63]).reshape(5,3)
test
array([[ 4.  ,  2.  ,  0.6 ],
[ 4.2 ,  2.1 ,  0.59],
[ 3.9 ,  2.  ,  0.58],
[ 4.3 ,  2.1 ,  0.62],
[ 4.1 ,  2.2 ,  0.63]])
``````

I calculate the covariance with the numpy function:

``````np.cov(test)
array([[ 2.92      ,  3.098     ,  2.846     ,  3.164     ,  2.966     ],
[ 3.098     ,  3.28703333,  3.0199    ,  3.3566    ,  3.1479    ],
[ 2.846     ,  3.0199    ,  2.7748    ,  3.0832    ,  2.8933    ],
[ 3.164     ,  3.3566    ,  3.0832    ,  3.4288    ,  3.2122    ],
[ 2.966     ,  3.1479    ,  2.8933    ,  3.2122    ,  3.0193    ]])
``````

This however is different than following the covariance formula:

``````mean=np.mean(test,0)
np.dot(test-mean,(test-mean).T)/(5-1)
array([[ 0.004104, -0.002886,  0.006624, -0.005416, -0.002426],
[-0.002886,  0.002649, -0.005316,  0.005044,  0.000509],
[ 0.006624, -0.005316,  0.011744, -0.010496, -0.002556],
[-0.005416,  0.005044, -0.010496,  0.010164,  0.000704],
[-0.002426,  0.000509, -0.002556,  0.000704,  0.003769]])
``````

This does not match the numpy calculations.
In fact, I take a peek at the source code and the equation is
`(x-m) * (x-m).T.conj() / (N - 1)`
which I believe I am implementing.

The difference comes from the fact that the `np.cov` calculates the covariance between row vectors, which is why the result is `5*5` instead of `3*3`, but `np.mean` calculates the average of column vectors and when you do `test - mean` the calculation is also broadcasted along column which differs from what `np.cov` is doing, the fix would be a two-step:

Firstly, make sure the mean is calculated for each row, which can be done by simply transposing the `test` matrix:

``````mean = np.mean(test.T, 0)
``````

And then when calculate `x - x_bar`, reshape the mean vector so that the minus is along the rows as well, and also since the vector under test is row vector the dimension is going to be `3` instead of `5`. After these fixing, it will give consistent results as `np.cov` does:

``````np.dot(test-mean[:, None],(test-mean[:, None]).T)/(3-1)

# array([[ 2.92      ,  3.098     ,  2.846     ,  3.164     ,  2.966     ],
#        [ 3.098     ,  3.28703333,  3.0199    ,  3.3566    ,  3.1479    ],
#        [ 2.846     ,  3.0199    ,  2.7748    ,  3.0832    ,  2.8933    ],
#        [ 3.164     ,  3.3566    ,  3.0832    ,  3.4288    ,  3.2122    ],
#        [ 2.966     ,  3.1479    ,  2.8933    ,  3.2122    ,  3.0193    ]])
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download