snowmonkey - 1 year ago 71
Python Question

# Build numpy array with multiple custom index ranges without explicit loop

In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).

For ex.,

``````array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in
np.arange(0,len(array1))])

print array3
``````

result:
`[10,11,12,13,65,66,67,68,69,200,201,202,203]`
.

## Prospective Approach

I will go backwards on how to approach this problem.

Take the sample listed in the question. We have -

``````array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
``````

Now, look at the desired result -

``````result: [10,11,12,13,65,66,67,68,69,200,201,202,203]
``````

Let's calculate the group lengths, as we would be needing those to explain the solution approach next.

``````In [58]: lens = array2 - array1

In [59]: lens
Out[59]: array([4, 5, 4])
``````

The idea is to use `1`'s initialized array, which when cumumlative summed across the entire length would give us the desired result. This cumumlative summation would be the last step to our solution. Why `1`'s initialized? Well, because we have an array that increasing in steps of `1`'s except at specific places where we have shifts corresponding to new groups coming in.

Now, since `cumsum` would be the last step, so the step before it should give us something like -

``````array([ 10,   1,   1,   1,  52,   1,   1,   1,   1, 131,   1,   1,   1])
``````

As discussed before, it's `1`'s filled with `[10,52,131]` at specific places. That `10` seems to be coming in from the first element in `array1`, but what about the rest? The second one `52` came in as `65-13` (looking at the `result`) and in it `13` came in the group that started with `10` and ran because of the length of the first group `4`. So, if we do `65 - 10 - 4`, we will get `51` and then add `1` to accomodate for boundary stop, we would have `52`, which is the desired shifting value. Similarly, we would get `131`.

Thus, those `shifting-values` could be computed, like so -

``````In [62]: np.diff(array1) - lens[:-1]+1
Out[62]: array([ 52, 131])
``````

Next up, to get those `shifting-places` where such shifts occur, we can simply do cumulative summation on the group lengths -

``````In [65]: lens[:-1].cumsum()
Out[65]: array([4, 9])
``````

For completeness, we need to pre-append `0` with the array of `shifting-places` and `array1[0]` for `shifting-values`.

So, we are set to present our approach in a step-by-step format!

## Putting back the pieces

1] Get lengths of each group :

``````lens = array2 - array1
``````

2] Get indices at which shifts occur and values to be put in `1`'s initialized array :

``````shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
``````

3] Setup `1`'s initialized ID array for inserting those values at those indices listed in the step before :

``````id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
``````

4] Finally do cumulative summation on the ID array :

``````output = id_arr.cumsum()
``````

Listed in a function format, we would have -

``````def using_ones_cumsum(array1, array2):
lens = array2 - array1
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
return id_arr.cumsum()
``````

And it works on overlapping ranges too!

``````In [67]: array1 = np.array([10, 11, 200])
...: array2 = np.array([14, 18, 204])
...:

In [68]: using_ones_cumsum(array1, array2)
Out[68]:
array([ 10,  11,  12,  13,  11,  12,  13,  14,  15,  16,  17, 200, 201,
202, 203])
``````

Runtime test

Let's time the proposed approach against the other vectorized approach in `@unutbu's flatnonzero based solution`, which already proved to be much better than the loopy approach -

``````In [38]: array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
...:                   .cumsum().reshape(2, -1, order='F'))

In [39]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 889 µs per loop

In [40]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 235 µs per loop
``````

## Improvement!

Now, codewise NumPy doesn't like appending. So, those `np.hstack` calls could be avoided for a slightly improved version as listed below -

``````def get_ranges_arr(starts,ends):
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
``````

Let's time it against our original approach -

``````In [151]: array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\
...:                                      .cumsum().reshape(2, -1, order='F'))

In [152]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 276 µs per loop

In [153]: %timeit get_ranges_arr(array1, array2)
10000 loops, best of 3: 193 µs per loop
``````

So, we have a `30%` performance boost there!

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download