AMagee AMagee - 2 months ago 8
Python Question

Grouping continguous values in a numpy array with their length

In numpy / scipy (or pure python if you prefer), what would be a good way to group contiguous regions in a numpy array and count the length of these regions?

Something like this:

x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])
y = contiguousGroup(x)
print y

>> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]]


I have tried to do this just with loops however it takes a longer time than I would like (6 seconds) to do a list with about 30 million samples and 20000 contiguous regions.

Edit:

And now for some speed comparisons (just using time.clock() and a few hundred iterations).

Firstly my python loop code tested on 5 samples.

Number of elements 33718251
Number of regions 135137
Time taken = 8.644007 seconds...

Number of elements 42503100
Number of regions 6985
Time taken = 10.533305 seconds...

Number of elements 21841302
Number of regions 7619335
Time taken = 7.671015 seconds...

Number of elements 19723928
Number of regions 10799
Time taken = 5.014807 seconds...

Number of elements 16619539
Number of regions 19293
Time taken = 4.207359 seconds...


And now with Divakar's vectorized solution.

Number of elements 33718251
Number of regions 135137
Time taken = 0.063470 seconds...

Number of elements 42503100
Number of regions 6985
Time taken = 0.046293 seconds...

Number of elements 21841302
Number of regions 7619335
Time taken = 1.654288 seconds...

Number of elements 19723928
Number of regions 10799
Time taken = 0.022651 seconds...

Number of elements 16619539
Number of regions 19293
Time taken = 0.021189 seconds...

Answer

Here's a vectorized approach -

idx = np.concatenate(([0],np.flatnonzero(x[:-1]!=x[1:])+1,[x.size]))
out = zip(x[idx[:-1]],np.diff(idx))

Sample run -

In [34]: x
Out[34]: array([1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 1, 2, 3, 1, 1, 0, 0, 0])

In [35]: out
Out[35]: [(1, 3), (2, 2), (3, 1), (0, 5), (1, 1), (2, 1), (3, 1), (1, 2), (0, 3)]

The concatenation on the entire array could be expensive. So, a modified version that does concatenation on the group shifting indices rather could be suggested, like so -

idx0 = np.flatnonzero(x[:-1]!=x[1:])
count = np.concatenate(([idx0[0]+1],np.diff(idx0),[x.size-idx0[-1]-1]))
out = zip(x[np.append(0,idx0+1)],count)

Alternatively, at the final step, if the output as a 2D array is okay, we could avoid that zipping and use NumPy's column_stack, like so -

out = np.column_stack((x[np.append(0,idx0+1)],count))
Comments