Lukasz - 1 year ago 141
Python Question

# Extracting Items From Sparse Matrix

I'm working with a series of text corpus' and in doing so I need to construct a co-occurrence matrix. I'm currently testing writing and testing my code so every time I run I get a different matrix (since

`list(set())`
is unordered. I've constructed a sparse matrix using
`scipy.sparse.coo_matrix()`
and would like to be able to use the coordinates and value generated by that type of construction. I imagine that this would be the fastest and most memory effictient way of doing it. At the moment when I try to access those values I am presented with

``````[<1x16 sparse matrix of type '<class 'numpy.float32'>'
with 10 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 7 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
``````

When I
`print`
the sparse matrix I get the following:

``````  (0, 1)    0.5
(0, 4)    1.0
(0, 6)    0.5
(1, 7)    1.0
(1, 11)   1.0
(1, 12)   1.0
(1, 13)   0.5
(2, 14)   0.5
...
(15, 6)   1.0
(15, 9)   0.5
(15, 15)  3.0
(15, 0)   2.0
(15, 1)   0.5
(15, 6)   0.5
(15, 14)  1.5
``````

I would imagine that retrieving those values as they appear is possible.

For the above example I extract the following instance:

``````row = [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,
9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13,
13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15]

column = [1, 4, 6, 7, 11, 12, 13, 14, 15, 0, 4, 9, 12, 13, 14, 15, 4, 5, 12, 13,
4, 9, 13, 14, 0, 1, 2, 3, 5, 8, 10, 12, 13, 14, 2, 4, 12, 13, 0, 14,
15, 0, 8, 11, 13, 4, 7, 10, 11, 1, 3, 12, 14, 4, 8, 11, 13, 0, 7, 8,
10, 0, 1, 2, 4, 5, 9, 13, 0, 1, 2, 3, 4, 5, 7, 10, 12, 0, 1, 3, 4, 6,
9, 15, 0, 1, 6, 14]

values = [0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5,
1.0, 0.5, 1.0, 0.5, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.5,
0.5, 1.0, 0.5, 0.5, 1.0, 1.0, 1.5, 2.0, 1.0, 2.5, 1.0, 3.0, 1.0, 0.5,
1.5, 2.0, 1.0, 1.0, 2.0, 0.5, 1.0, 0.5, 2.0, 2.0, 0.5, 4.0, 0.5, 0.5,
0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 0.5, 0.5, 2.5, 1.0,
4.0, 1.0, 1.0, 1.5, 1.0, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 3.0,
2.0, 0.5, 0.5, 1.5]

sps_array = sparse.coo_matrix((values, (row, column)), shape=(16, 16))
``````

At the moment I'm able to transform
`sps_array`
using
`sps_array.toarray`
followed then creating a lists where

``````list1 = list(np.nonzero(sps_array > 0)[0])
list2 = list(np.nonzero(sps_array > 0)[1])
``````

and creating the following
`for`
loop to reconstruct the coordinates

``````index = 0
sps_coordinates = []

for i in range(token_size):
for j in range(list1_count[i]):
sps_coordinates.append((list1[index+j], list2[index+j]))
index += list1_count[i]
``````

I retrieve the values by

``````list(sps_array[sps_array > 0]
``````

Is there a more efficient way to get those coordinates and values relative to what I have done?

With a copy-n-paste I construct your `sps_array`:

``````In [2126]: sps_array
Out[2126]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 88 stored elements in COOrdinate format>
``````

A `coo` format stores its values in 3 attributes, each an array (derived from the 3 input lists):

``````In [2127]: sps_array.data
Out[2127]:
array([ 0.5,  1. ,  0.5,  1. ,  1. ,  1. ,  0.5,  0.5,  1. ,  1. ,  0.5,
0.5,  1. ,  0.5,  1. ,  0.5,  1. ,  0.5,  1. ,  0.5,  0.5,  1. ,
0.5,  1. ,  1. ,  1. ,  1. ,  0.5,  0.5,  1. ,  0.5,  0.5,  1. ,
1. ,  1.5,  2. ,  1. ,  2.5,  1. ,  3. ,  1. ,  0.5,  1.5,  2. ,
1. ,  1. ,  2. ,  0.5,  1. ,  0.5,  2. ,  2. ,  0.5,  4. ,  0.5,
0.5,  0.5,  1. ,  1. ,  0.5,  0.5,  1. ,  0.5,  1. ,  1. ,  0.5,
0.5,  0.5,  2.5,  1. ,  4. ,  1. ,  1. ,  1.5,  1. ,  1. ,  1. ,
0.5,  1. ,  0.5,  1. ,  1. ,  0.5,  3. ,  2. ,  0.5,  0.5,  1.5])
In [2128]: sps_array.row
Out[2128]:
array([ 0,  0,  0,  1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,
3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  5,  5,
6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9, 10,
10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15], dtype=int32)
In [2129]: sps_array.col
Out[2129]:
array([ 1,  4,  6,  7, 11, 12, 13, 14, 15,  0,  4,  9, 12, 13, 14, 15,  4,
5, 12, 13,  4,  9, 13, 14,  0,  1,  2,  3,  5,  8, 10, 12, 13, 14,
2,  4, 12, 13,  0, 14, 15,  0,  8, 11, 13,  4,  7, 10, 11,  1,  3,
12, 14,  4,  8, 11, 13,  0,  7,  8, 10,  0,  1,  2,  4,  5,  9, 13,
0,  1,  2,  3,  4,  5,  7, 10, 12,  0,  1,  3,  4,  6,  9, 15,  0,
1,  6, 14], dtype=int32)
``````

A sparse matrix has a `nonzero` method, whose code is:

``````    A = self.tocoo()
``````

It makes sure the matrix is in `coo` format, makes sure there aren't any 'hidden' zeros in the data, and returns the `row` and `col` attributes.

This isn't needed if your matrix is already `coo`, but is needed if the matrix is in `csr` format.

So you don't need to go through the dense `toarray` and `np.nonzero` function. However the `np.nonzero(sps_array)` does work, because it delegates the task to `sps.array.nonzero()`.

Applying `transpose` to the `nonzero` gives an array that may be what you want:

``````In [2136]: np.transpose(np.nonzero(sps_array))
Out[2136]:
array([[ 0,  1],
[ 0,  4],
[ 0,  6],
[ 1,  7],
[ 1, 11],
[ 1, 12],
....
``````

In fact there is a np function that does just this (for any array) (look at its code or docs):

``````np.argwhere(sps_array)
``````

(you don't need to use `nonzero(sps_array>0)` - unless you are worried about negative values.)

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download