AMagee - 1 year ago 48

Python Question

In numpy / scipy (or pure python if you prefer), what would be a good way to group contiguous regions in a numpy array and count the length of these regions?

Something like this:

`x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])`

y = contiguousGroup(x)

print y

>> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]]

I have tried to do this just with loops however it takes a longer time than I would like (6 seconds) to do a list with about 30 million samples and 20000 contiguous regions.

Edit:

And now for some speed comparisons (just using time.clock() and a few hundred iterations).

Firstly my python loop code tested on 5 samples.

`Number of elements 33718251`

Number of regions 135137

Time taken = 8.644007 seconds...

Number of elements 42503100

Number of regions 6985

Time taken = 10.533305 seconds...

Number of elements 21841302

Number of regions 7619335

Time taken = 7.671015 seconds...

Number of elements 19723928

Number of regions 10799

Time taken = 5.014807 seconds...

Number of elements 16619539

Number of regions 19293

Time taken = 4.207359 seconds...

And now with Divakar's vectorized solution.

`Number of elements 33718251`

Number of regions 135137

Time taken = 0.063470 seconds...

Number of elements 42503100

Number of regions 6985

Time taken = 0.046293 seconds...

Number of elements 21841302

Number of regions 7619335

Time taken = 1.654288 seconds...

Number of elements 19723928

Number of regions 10799

Time taken = 0.022651 seconds...

Number of elements 16619539

Number of regions 19293

Time taken = 0.021189 seconds...

Answer Source

Here's a vectorized approach -

```
idx = np.concatenate(([0],np.flatnonzero(x[:-1]!=x[1:])+1,[x.size]))
out = zip(x[idx[:-1]],np.diff(idx))
```

Sample run -

```
In [34]: x
Out[34]: array([1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 1, 2, 3, 1, 1, 0, 0, 0])
In [35]: out
Out[35]: [(1, 3), (2, 2), (3, 1), (0, 5), (1, 1), (2, 1), (3, 1), (1, 2), (0, 3)]
```

The concatenation on the entire array could be expensive. So, a modified version that does concatenation on the group shifting indices rather could be suggested, like so -

```
idx0 = np.flatnonzero(x[:-1]!=x[1:])
count = np.concatenate(([idx0[0]+1],np.diff(idx0),[x.size-idx0[-1]-1]))
out = zip(x[np.append(0,idx0+1)],count)
```

Alternatively, at the final step, if the output as a `2D`

array is okay, we could avoid that `zipping`

and use NumPy's column_stack, like so -

```
out = np.column_stack((x[np.append(0,idx0+1)],count))
```