1a1a11a 1a1a11a - 1 month ago 10
Python Question

numpy multivariate_normal bug when dimension too high

I am working on a homework assignment and I noticed that when the dimension of mean and covariance is very high,

multivariate_normal
will occupy all CPU forever, without generating any results.

An example code snippet,

cov_true = np.eye(p)
mean_true = np.zeros(p)
beta_true = multivariate_normal(mean_true, cov_true, size=1).T


when
p=5000
, this will run forever.
environment, python3.4 and python3.5, numpy 1.11.0

Is it really a bug or did I miss something?

Answer

What takes so much time?

To account for relations between variables NumPy computes the singular value decomposition of your covariance matrix and this takes the majority of the time (the underlying GESDD is in general Θ(n3), and 50003 is already a bit).

How can things be sped up?

In the simplest case with all variables independent, you could just use random.normal:

from numpy.random import normal

sample = normal(means, deviations, len(means))

Otherwise, if your covariance matrix happens to be full rank (hence positive-definite), supplant svd with cholesky (still Θ(n3) in general, but with a smaller constant):

from numpy.random import standard_normal
from scipy.linalg import cholesky

l = cholesky(covariances, check_finite=False, overwrite_a=True)
sample = means + l.dot(standard_normal(len(means)))

If the matrix may be singular (as is sometimes the case), then either wrap SPSTRF or consider helping with scipy#6202.

Cholesky will likely be noticeably faster, but if that's not sufficient, then further you could research if if it wouldn't be possible to decompose the matrix analytically, or try using a different base library (such as ACML, MKL, or cuSOLVER).

Comments