AMagee - 1 year ago 71
Python Question

# Grouping continguous values in a numpy array with their length

In numpy / scipy (or pure python if you prefer), what would be a good way to group contiguous regions in a numpy array and count the length of these regions?

Something like this:

``````x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])
y = contiguousGroup(x)
print y

>> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]]
``````

I have tried to do this just with loops however it takes a longer time than I would like (6 seconds) to do a list with about 30 million samples and 20000 contiguous regions.

Edit:

And now for some speed comparisons (just using time.clock() and a few hundred iterations).

Firstly my python loop code tested on 5 samples.

``````Number of elements  33718251
Number of regions  135137
Time taken = 8.644007 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 10.533305 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 7.671015 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 5.014807 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 4.207359 seconds...
``````

And now with Divakar's vectorized solution.

``````Number of elements  33718251
Number of regions  135137
Time taken = 0.063470 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 0.046293 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 1.654288 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 0.022651 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 0.021189 seconds...
``````

Here's a vectorized approach -

``````idx = np.concatenate(([0],np.flatnonzero(x[:-1]!=x[1:])+1,[x.size]))
out = zip(x[idx[:-1]],np.diff(idx))
``````

Sample run -

``````In [34]: x
Out[34]: array([1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 1, 2, 3, 1, 1, 0, 0, 0])

In [35]: out
Out[35]: [(1, 3), (2, 2), (3, 1), (0, 5), (1, 1), (2, 1), (3, 1), (1, 2), (0, 3)]
``````

The concatenation on the entire array could be expensive. So, a modified version that does concatenation on the group shifting indices rather could be suggested, like so -

``````idx0 = np.flatnonzero(x[:-1]!=x[1:])
count = np.concatenate(([idx0[0]+1],np.diff(idx0),[x.size-idx0[-1]-1]))
out = zip(x[np.append(0,idx0+1)],count)
``````

Alternatively, at the final step, if the output as a `2D` array is okay, we could avoid that `zipping` and use NumPy's column_stack, like so -

``````out = np.column_stack((x[np.append(0,idx0+1)],count))
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download