snowmonkey - 1 year ago 62

Python Question

In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).

For ex.,

`array1 = np.array([10, 65, 200])`

array2 = np.array([14, 70, 204])

array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in

np.arange(0,len(array1))])

print array3

result:

`[10,11,12,13,65,66,67,68,69,200,201,202,203]`

Answer Source

I will go backwards on how to approach this problem.

Take the sample listed in the question. We have -

```
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
```

Now, look at the desired result -

```
result: [10,11,12,13,65,66,67,68,69,200,201,202,203]
```

Let's calculate the group lengths, as we would be needing those to explain the solution approach next.

```
In [58]: lens = array2 - array1
In [59]: lens
Out[59]: array([4, 5, 4])
```

The idea is to use `1`

's initialized array, which when cumumlative summed across the entire length would give us the desired result.
This cumumlative summation would be the last step to our solution.
Why `1`

's initialized? Well, because we have an array that increasing in steps of `1`

's except at specific places where we have shifts
corresponding to new groups coming in.

Now, since `cumsum`

would be the last step, so the step before it should give us something like -

```
array([ 10, 1, 1, 1, 52, 1, 1, 1, 1, 131, 1, 1, 1])
```

As discussed before, it's `1`

's filled with `[10,52,131]`

at specific places. That `10`

seems to be coming in from the first element in `array1`

, but what about the rest?
The second one `52`

came in as `65-13`

(looking at the `result`

) and in it `13`

came in the group that started with `10`

and ran because of the length of
the first group `4`

. So, if we do `65 - 10 - 4`

, we will get `51`

and then add `1`

to accomodate for boundary stop, we would have `52`

, which is the
desired shifting value. Similarly, we would get `131`

.

Thus, those `shifting-values`

could be computed, like so -

```
In [62]: np.diff(array1) - lens[:-1]+1
Out[62]: array([ 52, 131])
```

Next up, to get those `shifting-places`

where such shifts occur, we can simply do cumulative summation on the group lengths -

```
In [65]: lens[:-1].cumsum()
Out[65]: array([4, 9])
```

For completeness, we need to pre-append `0`

with the array of `shifting-places`

and `array1[0]`

for `shifting-values`

.

So, we are set to present our approach in a step-by-step format!

1] Get lengths of each group :

```
lens = array2 - array1
```

2] Get indices at which shifts occur and values to be put in `1`

's initialized array :

```
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
```

3] Setup `1`

's initialized ID array for inserting those values at those indices listed in the step before :

```
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
```

4] Finally do cumulative summation on the ID array :

```
output = id_arr.cumsum()
```

Listed in a function format, we would have -

```
def using_ones_cumsum(array1, array2):
lens = array2 - array1
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
return id_arr.cumsum()
```

And it works on overlapping ranges too!

```
In [67]: array1 = np.array([10, 11, 200])
...: array2 = np.array([14, 18, 204])
...:
In [68]: using_ones_cumsum(array1, array2)
Out[68]:
array([ 10, 11, 12, 13, 11, 12, 13, 14, 15, 16, 17, 200, 201,
202, 203])
```

**Runtime test**

Let's time the proposed approach against the other vectorized approach in `@unutbu's flatnonzero based solution`

, which already proved to be much better than the loopy approach -

```
In [38]: array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
...: .cumsum().reshape(2, -1, order='F'))
In [39]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 889 µs per loop
In [40]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 235 µs per loop
```

Now, codewise NumPy doesn't like appending. So, those `np.hstack`

calls could be avoided for a slightly improved version as listed below -

```
def get_ranges_arr(starts,ends):
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
```

Let's time it against our original approach -

```
In [151]: array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\
...: .cumsum().reshape(2, -1, order='F'))
In [152]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 276 µs per loop
In [153]: %timeit get_ranges_arr(array1, array2)
10000 loops, best of 3: 193 µs per loop
```

So, we have a ** 30%** performance boost there!