user1717931 user1717931 - 12 days ago 9
Python Question

Stacking two sparse matrices with different dimensions

I have two sparse-matrices (created out of

sklearn
HashVectorizer
, from two sets of features - each set corresponds to a feature). I want to concatenate them to later use them for clustering. But, I am facing a problem with dimensions, as the two matrices do not have the same row dimensions.

Here is an example:

Xa = [-0.57735027 -0.57735027 0.57735027 -0.57735027 -0.57735027 0.57735027
0.5 0.5 -0.5 0.5 0.5 -0.5 0.5
0.5 -0.5 0.5 -0.5 0.5 0.5 -0.5
0.5 0.5 ]

Xb = [-0.57735027 -0.57735027 0.57735027 -0.57735027 0.57735027 0.57735027
-0.5 0.5 0.5 0.5 -0.5 -0.5 0.5
-0.5 -0.5 -0.5 0.5 0.5 ]


Both
Xa
and
Xb
are of type
<class 'scipy.sparse.csr.csr_matrix'>
. Shapes are
Xa.shape = (6, 1048576) Xb.shape = (5, 1048576)
. The error I get is (which I know now why it happens):

X = hstack((Xa, Xb))
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 581, in bmat
'row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions


Is there a way to stack the sparse-matrices despite their irregular dimensions? Maybe with some padding?

I have looked into these posts:


Answer

You can pad it with an empty sparse matrix.

You want to horizontaly stack it so you need to pad the smaller matrix so that it has the same number of rows as the larger matrix. For that you vertically stack it with a matrix of shape (difference in number of rows, number of columns of original matrix).

Like this:

from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from scipy.sparse import vstack

# Create 2 empty sparse matrix for demo
Xa = csr_matrix((4, 4))
Xb = csr_matrix((3, 5))


diff_n_rows = Xa.shape[0] - Xb.shape[0]

Xb_new = vstack((Xb, csr_matrix((diff_n_rows, Xb.shape[1])))) 
#where diff_n_rows is the difference of the number of rows between Xa and Xb

X = hstack((Xa, Xb_new))
X

Which results in:

<4x9 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in COOrdinate format>
Comments