Stormvirux Stormvirux - 1 month ago 22
Python Question

Consecutive elements in a Sparse matrix row

I am working on a sparse matrix stored in COO format. What would be the fastest way to get the number of consecutive elements per each row.

For example consider the following matrix:

a = [[0,1,2,0],[1,0,0,2],[0,0,0,0],[1,0,1,0]]


Its COO representation would be

(0, 1) 1
(0, 2) 2
(1, 0) 1
(1, 3) 2
(3, 0) 1
(3, 2) 1


I need the result to be
[1,2,0,2]
. The first row contains two Non-zero elements that lies nearby. Hence its a group or set. In the second row we have two non-zero elements,but they dont lie nearby, and hence we can say that it forms two groups. The third row there are no non-zeroes and hence no groups. The fourth row has again two non-zeroes but separated by zeroes nad hence we consider as two groups. It would be like the number of clusters per row. Iterating through the rows are an option but only if there is no faster solution. Any help in this regard is appreciated.

Another simple example: consider the following row:

[1,2,3,0,0,0,2,0,0,8,7,6,0,0]


The above row should return
[3]
sine there are three groups of non-zeroes getting separated by zeroes.

Answer

Convert it to a dense array, and apply your logic row by row.

  • you want the number of groups per row
  • zeros count when defining groups
  • row iteration is faster with arrays

In coo format your matrix looks like:

In [623]: M=sparse.coo_matrix(a)
In [624]: M.data
Out[624]: array([1, 2, 1, 2, 1, 1])
In [625]: M.row
Out[625]: array([0, 0, 1, 1, 3, 3], dtype=int32)
In [626]: M.col
Out[626]: array([1, 2, 0, 3, 0, 2], dtype=int32)

This format does not implement row indexing; csr and lil do

In [627]: M.tolil().data
Out[627]: array([[1, 2], [1, 2], [], [1, 1]], dtype=object)
In [628]: M.tolil().rows
Out[628]: array([[1, 2], [0, 3], [], [0, 2]], dtype=object)

So the sparse information for the 1st row is a list of nonzero data values, [1,2], and list of their column numbers, [1,2]. Compare that with the row of the dense array, [0, 1, 2, 0]. Which is easier to analyze?

Your first task is to write a function that analyzes one row. I haven't studied your logic enough to say whether the dense form is better than the sparse one or not. It is easy to get the column information from the dense form with M.A[0,:].nonzero().

In your last example, I can get the nonzero indices:

In [631]: np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])
Out[631]: (array([ 0,  1,  2,  6,  9, 10, 11], dtype=int32),)
In [632]: idx=np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])[0]
In [633]: idx
Out[633]: array([ 0,  1,  2,  6,  9, 10, 11], dtype=int32)
In [634]: np.diff(idx)
Out[634]: array([1, 1, 4, 3, 1, 1], dtype=int32)

We may be able to get the desired count from the number of diff values >1, though I'd have to look at more examples to define the details.

Extension of the analysis to multiple rows depends on first thoroughly understanding the single row case.

Comments