revy revy - 2 months ago 16
Python Question

scipy.sparse matrix: subtract row mean to nonzero elements

I have a sparse matrix in csr_matrix format. For each row i need to subtract row mean from the nonzero elements. The means must be computed on the number of the nonzero elements of the row (instead of the length of the row).
I found a fast way yo compute the row means with the following code:

# M is a csr_matrix
sums = np.squeeze(np.asarray(M.sum(1))) # sum of the nonzero elements, for each row
counts = np.diff(M.tocsr().indptr) # count of the nonzero elements, for each row


# for the i-th row the mean is just sums[i] / float(counts[i])


The problem is the updates part. I need a fast way to do this.
Actually what i am doing is to transform M to a lil_matrix and perform the updates in this way:

M = M.tolil()

for i in xrange(len()):
for j in M.getrow(i).nonzero()[1]:
M[i, j] -= sums[i] / float(counts[i])


which is slow. Any suggestion for a faster solution?

Answer

This one is tricky. I think I have it. The basic idea is that we try to get a diagonal matrix with the means on the diagonal, and a matrix that is like M, but has ones at the nonzero data locations in M. Then we multiply those and subtract the product from M. Here goes...

>>> import numpy as np
>>> import scipy.sparse as sp
>>> a = sp.csr_matrix([[1., 0., 2.], [1.,2.,3.]])
>>> a.todense()
matrix([[ 1.,  0.,  2.],
        [ 1.,  2.,  3.]])
>>> tot = np.array(a.sum(axis=1).squeeze())[0]
>>> tot
array([ 3.,  6.])
>>> cts = np.diff(a.indptr)
>>> cts
array([2, 3], dtype=int32)
>>> mu = tot/cts
>>> mu
array([ 1.5,  2. ])
>>> d = sp.diags(mu, 0)
>>> d.todense()
matrix([[ 1.5,  0. ],
        [ 0. ,  2. ]])
>>> b = a.copy()
>>> b.data = np.ones_like(b.data)
>>> b.todense()
matrix([[ 1.,  0.,  1.],
        [ 1.,  1.,  1.]])
>>> (d * b).todense()
matrix([[ 1.5,  0. ,  1.5],
        [ 2. ,  2. ,  2. ]])
>>> (a - d*b).todense()
matrix([[-0.5,  0. ,  0.5],
        [-1. ,  0. ,  1. ]])

Good Luck! Hope that helps.