In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in
I will go backwards on how to approach this problem.
Take the sample listed in the question. We have -
array1 = np.array([10, 65, 200]) array2 = np.array([14, 70, 204])
Now, look at the desired result -
Let's calculate the group lengths, as we would be needing those to explain the solution approach next.
In : lens = array2 - array1 In : lens Out: array([4, 5, 4])
The idea is to use
1's initialized array, which when cumumlative summed across the entire length would give us the desired result.
This cumumlative summation would be the last step to our solution.
1's initialized? Well, because we have an array that increasing in steps of
1's except at specific places where we have shifts
corresponding to new groups coming in.
cumsum would be the last step, so the step before it should give us something like -
array([ 10, 1, 1, 1, 52, 1, 1, 1, 1, 131, 1, 1, 1])
As discussed before, it's
1's filled with
[10,52,131] at specific places. That
10 seems to be coming in from the first element in
array1, but what about the rest?
The second one
52 came in as
65-13 (looking at the
result) and in it
13 came in the group that started with
10 and ran because of the length of
the first group
4. So, if we do
65 - 10 - 4, we will get
51 and then add
1 to accomodate for boundary stop, we would have
52, which is the
desired shifting value. Similarly, we would get
shifting-values could be computed, like so -
In : np.diff(array1) - lens[:-1]+1 Out: array([ 52, 131])
Next up, to get those
shifting-places where such shifts occur, we can simply do cumulative summation on the group lengths -
In : lens[:-1].cumsum() Out: array([4, 9])
For completeness, we need to pre-append
0 with the array of
So, we are set to present our approach in a step-by-step format!
1] Get lengths of each group :
lens = array2 - array1
2] Get indices at which shifts occur and values to be put in
1's initialized array :
shift_idx = np.hstack((0,lens[:-1].cumsum())) shift_vals = np.hstack((array1,np.diff(array1) - lens[:-1]+1))
1's initialized ID array for inserting those values at those indices listed in the step before :
id_arr = np.ones(lens.sum(),dtype=array1.dtype) id_arr[shift_idx] = shift_vals
4] Finally do cumulative summation on the ID array :
output = id_arr.cumsum()
Listed in a function format, we would have -
def using_ones_cumsum(array1, array2): lens = array2 - array1 shift_idx = np.hstack((0,lens[:-1].cumsum())) shift_vals = np.hstack((array1,np.diff(array1) - lens[:-1]+1)) id_arr = np.ones(lens.sum(),dtype=array1.dtype) id_arr[shift_idx] = shift_vals return id_arr.cumsum()
And it works on overlapping ranges too!
In : array1 = np.array([10, 11, 200]) ...: array2 = np.array([14, 18, 204]) ...: In : using_ones_cumsum(array1, array2) Out: array([ 10, 11, 12, 13, 11, 12, 13, 14, 15, 16, 17, 200, 201, 202, 203])
Let's time the proposed approach against the other vectorized approach in
@unutbu's flatnonzero based solution, which already proved to be much better than the loopy approach -
In : array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True) ...: .cumsum().reshape(2, -1, order='F')) In : %timeit using_flatnonzero(array1, array2) 1000 loops, best of 3: 889 µs per loop In : %timeit using_ones_cumsum(array1, array2) 1000 loops, best of 3: 235 µs per loop
Now, codewise NumPy doesn't like appending. So, those
np.hstack calls could be avoided for a slightly improved version as listed below -
def get_ranges_arr(starts,ends): counts = ends - starts counts_csum = counts.cumsum() id_arr = np.ones(counts_csum[-1],dtype=int) id_arr = starts id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1 return id_arr.cumsum()
Let's time it against our original approach -
In : array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\ ...: .cumsum().reshape(2, -1, order='F')) In : %timeit using_ones_cumsum(array1, array2) 1000 loops, best of 3: 276 µs per loop In : %timeit get_ranges_arr(array1, array2) 10000 loops, best of 3: 193 µs per loop
So, we have a
30% performance boost there!