Stormvirux - 1 month ago 22

Python Question

I am working on a sparse matrix stored in COO format. What would be the fastest way to get the number of consecutive elements per each row.

For example consider the following matrix:

`a = [[0,1,2,0],[1,0,0,2],[0,0,0,0],[1,0,1,0]]`

Its COO representation would be

`(0, 1) 1`

(0, 2) 2

(1, 0) 1

(1, 3) 2

(3, 0) 1

(3, 2) 1

I need the result to be

`[1,2,0,2]`

Another simple example: consider the following row:

`[1,2,3,0,0,0,2,0,0,8,7,6,0,0]`

The above row should return

`[3]`

Answer

Convert it to a dense array, and apply your logic row by row.

- you want the number of groups per row
- zeros count when defining groups
- row iteration is faster with arrays

In `coo`

format your matrix looks like:

```
In [623]: M=sparse.coo_matrix(a)
In [624]: M.data
Out[624]: array([1, 2, 1, 2, 1, 1])
In [625]: M.row
Out[625]: array([0, 0, 1, 1, 3, 3], dtype=int32)
In [626]: M.col
Out[626]: array([1, 2, 0, 3, 0, 2], dtype=int32)
```

This format does not implement row indexing; `csr`

and `lil`

do

```
In [627]: M.tolil().data
Out[627]: array([[1, 2], [1, 2], [], [1, 1]], dtype=object)
In [628]: M.tolil().rows
Out[628]: array([[1, 2], [0, 3], [], [0, 2]], dtype=object)
```

So the sparse information for the 1st row is a list of nonzero data values, `[1,2]`

, and list of their column numbers, `[1,2]`

. Compare that with the row of the dense array, `[0, 1, 2, 0]`

. Which is easier to analyze?

Your first task is to write a function that analyzes one row. I haven't studied your logic enough to say whether the dense form is better than the sparse one or not. It is easy to get the column information from the dense form with `M.A[0,:].nonzero()`

.

In your last example, I can get the nonzero indices:

```
In [631]: np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])
Out[631]: (array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32),)
In [632]: idx=np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])[0]
In [633]: idx
Out[633]: array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32)
In [634]: np.diff(idx)
Out[634]: array([1, 1, 4, 3, 1, 1], dtype=int32)
```

We may be able to get the desired count from the number of `diff`

values `>1`

, though I'd have to look at more examples to define the details.

Extension of the analysis to multiple rows depends on first thoroughly understanding the single row case.

Source (Stackoverflow)

Comments