Lukasz - 9 months ago 62

Python Question

I'm working with a series of text corpus' and in doing so I need to construct a co-occurrence matrix. I'm currently testing writing and testing my code so every time I run I get a different matrix (since

`list(set())`

`scipy.sparse.coo_matrix()`

`[<1x16 sparse matrix of type '<class 'numpy.float32'>'`

with 10 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'

with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'

with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'

with 7 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'

When I

`print`

`(0, 1) 0.5`

(0, 4) 1.0

(0, 6) 0.5

(1, 7) 1.0

(1, 11) 1.0

(1, 12) 1.0

(1, 13) 0.5

(2, 14) 0.5

...

(15, 6) 1.0

(15, 9) 0.5

(15, 15) 3.0

(15, 0) 2.0

(15, 1) 0.5

(15, 6) 0.5

(15, 14) 1.5

I would imagine that retrieving those values as they appear is possible.

For the above example I extract the following instance:

`row = [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,`

4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,

9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13,

13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15,

15, 15, 15, 15, 15, 15, 15]

column = [1, 4, 6, 7, 11, 12, 13, 14, 15, 0, 4, 9, 12, 13, 14, 15, 4, 5, 12, 13,

4, 9, 13, 14, 0, 1, 2, 3, 5, 8, 10, 12, 13, 14, 2, 4, 12, 13, 0, 14,

15, 0, 8, 11, 13, 4, 7, 10, 11, 1, 3, 12, 14, 4, 8, 11, 13, 0, 7, 8,

10, 0, 1, 2, 4, 5, 9, 13, 0, 1, 2, 3, 4, 5, 7, 10, 12, 0, 1, 3, 4, 6,

9, 15, 0, 1, 6, 14]

values = [0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5,

1.0, 0.5, 1.0, 0.5, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.5,

0.5, 1.0, 0.5, 0.5, 1.0, 1.0, 1.5, 2.0, 1.0, 2.5, 1.0, 3.0, 1.0, 0.5,

1.5, 2.0, 1.0, 1.0, 2.0, 0.5, 1.0, 0.5, 2.0, 2.0, 0.5, 4.0, 0.5, 0.5,

0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 0.5, 0.5, 2.5, 1.0,

4.0, 1.0, 1.0, 1.5, 1.0, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 3.0,

2.0, 0.5, 0.5, 1.5]

sps_array = sparse.coo_matrix((values, (row, column)), shape=(16, 16))

At the moment I'm able to transform

`sps_array`

`sps_array.toarray`

`list1 = list(np.nonzero(sps_array > 0)[0])`

list2 = list(np.nonzero(sps_array > 0)[1])

and creating the following

`for`

`index = 0`

sps_coordinates = []

for i in range(token_size):

for j in range(list1_count[i]):

sps_coordinates.append((list1[index+j], list2[index+j]))

index += list1_count[i]

I retrieve the values by

`list(sps_array[sps_array > 0]`

Is there a more efficient way to get those coordinates and values relative to what I have done?

Answer Source

With a copy-n-paste I construct your `sps_array`

:

```
In [2126]: sps_array
Out[2126]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 88 stored elements in COOrdinate format>
```

A `coo`

format stores its values in 3 attributes, each an array (derived from the 3 input lists):

```
In [2127]: sps_array.data
Out[2127]:
array([ 0.5, 1. , 0.5, 1. , 1. , 1. , 0.5, 0.5, 1. , 1. , 0.5,
0.5, 1. , 0.5, 1. , 0.5, 1. , 0.5, 1. , 0.5, 0.5, 1. ,
0.5, 1. , 1. , 1. , 1. , 0.5, 0.5, 1. , 0.5, 0.5, 1. ,
1. , 1.5, 2. , 1. , 2.5, 1. , 3. , 1. , 0.5, 1.5, 2. ,
1. , 1. , 2. , 0.5, 1. , 0.5, 2. , 2. , 0.5, 4. , 0.5,
0.5, 0.5, 1. , 1. , 0.5, 0.5, 1. , 0.5, 1. , 1. , 0.5,
0.5, 0.5, 2.5, 1. , 4. , 1. , 1. , 1.5, 1. , 1. , 1. ,
0.5, 1. , 0.5, 1. , 1. , 0.5, 3. , 2. , 0.5, 0.5, 1.5])
In [2128]: sps_array.row
Out[2128]:
array([ 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10,
10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15], dtype=int32)
In [2129]: sps_array.col
Out[2129]:
array([ 1, 4, 6, 7, 11, 12, 13, 14, 15, 0, 4, 9, 12, 13, 14, 15, 4,
5, 12, 13, 4, 9, 13, 14, 0, 1, 2, 3, 5, 8, 10, 12, 13, 14,
2, 4, 12, 13, 0, 14, 15, 0, 8, 11, 13, 4, 7, 10, 11, 1, 3,
12, 14, 4, 8, 11, 13, 0, 7, 8, 10, 0, 1, 2, 4, 5, 9, 13,
0, 1, 2, 3, 4, 5, 7, 10, 12, 0, 1, 3, 4, 6, 9, 15, 0,
1, 6, 14], dtype=int32)
```

A sparse matrix has a `nonzero`

method, whose code is:

```
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask],A.col[nz_mask])
```

It makes sure the matrix is in `coo`

format, makes sure there aren't any 'hidden' zeros in the data, and returns the `row`

and `col`

attributes.

This isn't needed if your matrix is already `coo`

, but is needed if the matrix is in `csr`

format.

So you don't need to go through the dense `toarray`

and `np.nonzero`

function. However the `np.nonzero(sps_array)`

does work, because it delegates the task to `sps.array.nonzero()`

.

Applying `transpose`

to the `nonzero`

gives an array that may be what you want:

```
In [2136]: np.transpose(np.nonzero(sps_array))
Out[2136]:
array([[ 0, 1],
[ 0, 4],
[ 0, 6],
[ 1, 7],
[ 1, 11],
[ 1, 12],
....
```

In fact there is a np function that does just this (for any array) (look at its code or docs):

```
np.argwhere(sps_array)
```

(you don't need to use `nonzero(sps_array>0)`

- unless you are worried about negative values.)