Lukasz Lukasz - 1 month ago 11
Python Question

Extracting Items From Sparse Matrix

I'm working with a series of text corpus' and in doing so I need to construct a co-occurrence matrix. I'm currently testing writing and testing my code so every time I run I get a different matrix (since

list(set())
is unordered. I've constructed a sparse matrix using
scipy.sparse.coo_matrix()
and would like to be able to use the coordinates and value generated by that type of construction. I imagine that this would be the fastest and most memory effictient way of doing it. At the moment when I try to access those values I am presented with

[<1x16 sparse matrix of type '<class 'numpy.float32'>'
with 10 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 7 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'


When I
print
the sparse matrix I get the following:

(0, 1) 0.5
(0, 4) 1.0
(0, 6) 0.5
(1, 7) 1.0
(1, 11) 1.0
(1, 12) 1.0
(1, 13) 0.5
(2, 14) 0.5
...
(15, 6) 1.0
(15, 9) 0.5
(15, 15) 3.0
(15, 0) 2.0
(15, 1) 0.5
(15, 6) 0.5
(15, 14) 1.5


I would imagine that retrieving those values as they appear is possible.

For the above example I extract the following instance:

row = [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,
9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13,
13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15]

column = [1, 4, 6, 7, 11, 12, 13, 14, 15, 0, 4, 9, 12, 13, 14, 15, 4, 5, 12, 13,
4, 9, 13, 14, 0, 1, 2, 3, 5, 8, 10, 12, 13, 14, 2, 4, 12, 13, 0, 14,
15, 0, 8, 11, 13, 4, 7, 10, 11, 1, 3, 12, 14, 4, 8, 11, 13, 0, 7, 8,
10, 0, 1, 2, 4, 5, 9, 13, 0, 1, 2, 3, 4, 5, 7, 10, 12, 0, 1, 3, 4, 6,
9, 15, 0, 1, 6, 14]

values = [0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5,
1.0, 0.5, 1.0, 0.5, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.5,
0.5, 1.0, 0.5, 0.5, 1.0, 1.0, 1.5, 2.0, 1.0, 2.5, 1.0, 3.0, 1.0, 0.5,
1.5, 2.0, 1.0, 1.0, 2.0, 0.5, 1.0, 0.5, 2.0, 2.0, 0.5, 4.0, 0.5, 0.5,
0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 0.5, 0.5, 2.5, 1.0,
4.0, 1.0, 1.0, 1.5, 1.0, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 3.0,
2.0, 0.5, 0.5, 1.5]

sps_array = sparse.coo_matrix((values, (row, column)), shape=(16, 16))


At the moment I'm able to transform
sps_array
using
sps_array.toarray
followed then creating a lists where

list1 = list(np.nonzero(sps_array > 0)[0])
list2 = list(np.nonzero(sps_array > 0)[1])


and creating the following
for
loop to reconstruct the coordinates

index = 0
sps_coordinates = []

for i in range(token_size):
for j in range(list1_count[i]):
sps_coordinates.append((list1[index+j], list2[index+j]))
index += list1_count[i]


I retrieve the values by

list(sps_array[sps_array > 0]


Is there a more efficient way to get those coordinates and values relative to what I have done?

Answer

With a copy-n-paste I construct your sps_array:

In [2126]: sps_array
Out[2126]: 
<16x16 sparse matrix of type '<class 'numpy.float64'>'
    with 88 stored elements in COOrdinate format>

A coo format stores its values in 3 attributes, each an array (derived from the 3 input lists):

In [2127]: sps_array.data
Out[2127]: 
array([ 0.5,  1. ,  0.5,  1. ,  1. ,  1. ,  0.5,  0.5,  1. ,  1. ,  0.5,
        0.5,  1. ,  0.5,  1. ,  0.5,  1. ,  0.5,  1. ,  0.5,  0.5,  1. ,
        0.5,  1. ,  1. ,  1. ,  1. ,  0.5,  0.5,  1. ,  0.5,  0.5,  1. ,
        1. ,  1.5,  2. ,  1. ,  2.5,  1. ,  3. ,  1. ,  0.5,  1.5,  2. ,
        1. ,  1. ,  2. ,  0.5,  1. ,  0.5,  2. ,  2. ,  0.5,  4. ,  0.5,
        0.5,  0.5,  1. ,  1. ,  0.5,  0.5,  1. ,  0.5,  1. ,  1. ,  0.5,
        0.5,  0.5,  2.5,  1. ,  4. ,  1. ,  1. ,  1.5,  1. ,  1. ,  1. ,
        0.5,  1. ,  0.5,  1. ,  1. ,  0.5,  3. ,  2. ,  0.5,  0.5,  1.5])
In [2128]: sps_array.row
Out[2128]: 
array([ 0,  0,  0,  1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  5,  5,
        6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9, 10,
       10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
       13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15,
       15, 15, 15], dtype=int32)
In [2129]: sps_array.col
Out[2129]: 
array([ 1,  4,  6,  7, 11, 12, 13, 14, 15,  0,  4,  9, 12, 13, 14, 15,  4,
        5, 12, 13,  4,  9, 13, 14,  0,  1,  2,  3,  5,  8, 10, 12, 13, 14,
        2,  4, 12, 13,  0, 14, 15,  0,  8, 11, 13,  4,  7, 10, 11,  1,  3,
       12, 14,  4,  8, 11, 13,  0,  7,  8, 10,  0,  1,  2,  4,  5,  9, 13,
        0,  1,  2,  3,  4,  5,  7, 10, 12,  0,  1,  3,  4,  6,  9, 15,  0,
        1,  6, 14], dtype=int32)

A sparse matrix has a nonzero method, whose code is:

    A = self.tocoo()
    nz_mask = A.data != 0
    return (A.row[nz_mask],A.col[nz_mask])

It makes sure the matrix is in coo format, makes sure there aren't any 'hidden' zeros in the data, and returns the row and col attributes.

This isn't needed if your matrix is already coo, but is needed if the matrix is in csr format.

So you don't need to go through the dense toarray and np.nonzero function. However the np.nonzero(sps_array) does work, because it delegates the task to sps.array.nonzero().

Applying transpose to the nonzero gives an array that may be what you want:

In [2136]: np.transpose(np.nonzero(sps_array))
Out[2136]: 
array([[ 0,  1],
       [ 0,  4],
       [ 0,  6],
       [ 1,  7],
       [ 1, 11],
       [ 1, 12],
       ....

In fact there is a np function that does just this (for any array) (look at its code or docs):

np.argwhere(sps_array)

(you don't need to use nonzero(sps_array>0) - unless you are worried about negative values.)