Diogo Santos - 4 months ago 47

Python Question

I need to generate a large array with repeated elements, and my code is:

`np.repeat(xrange(x,y), data)`

However, data is a numpy array with type float64 (but it represent integeres, no 2.1 there) and I get the error

`TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'`

Exemple:

`In [35]: x`

Out[35]: 26

In [36]: y

Out[36]: 50

In [37]: data

Out[37]:

array([ 3269., 106., 5533., 317., 1512., 208., 502., 919.,

406., 421., 1690., 2236., 705., 505., 230., 213.,

307., 1628., 4389., 1491., 355., 103., 854., 424.])

In [38]: np.repeat(xrange(x,y), data)

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-38-105860821359> in <module>()

----> 1 np.repeat(xrange(x,y), data)

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy /core/fromnumeric.pyc in repeat(a, repeats, axis)

394 repeat = a.repeat

395 except AttributeError:

--> 396 return _wrapit(a, 'repeat', repeats, axis)

397 return repeat(repeats, axis)

398

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy /core/fromnumeric.pyc in _wrapit(obj, method, *args, **kwds)

46 except AttributeError:

47 wrap = None

---> 48 result = getattr(asarray(obj), method)(*args, **kwds)

49 if wrap:

50 if not isinstance(result, mu.ndarray):

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

I solve it by changing the code to

`np.repeat(xrange(x,y), data.astype('int64'))`

However, this is now one of the most expensive lines in my code!! Is there another alternative?

By the way, I using this inside

`np.random.choice(np.repeat(xrange(x,y), data.astype('int64')), z)`

in order to get a sample without replacement with size z of the integers between x and y, with the number of each given in data. I guess this is the best approach for that also right?

Answer

Pretty interesting problem this one! Just to give the readers an idea about the problem without going into the minor data conversion issues, we have a range of values, let's say `a = np.arange(5)`

, i.e.

```
a = np.array([0,1,2,3,4])
```

Now, let's say we have another array with the number of repetitions listed for each of the `5`

numbers in `a`

. So, let those be :

```
reps = np.array([2,4,6,2,2])
```

Next up, we are performing those repetitions :

```
In [32]: rep_nums = np.repeat(a,reps)
In [33]: rep_nums
Out[33]: array([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4])
```

Finally, we are looking to choose `z`

number of elements out of those repeated numbers using `np.random.choice()`

and without replacement.

Let's say `z = 7`

to choose `7`

elements, so with `np.random.choice()`

, we would have :

```
In [34]: np.random.choice(rep_nums,7,replace=False)
Out[34]: array([2, 4, 0, 2, 4, 1, 2])
```

Now, this `without replacement`

term here might sound confusing, as we already have repeated numbers in `rep_nums`

. But, what it essentially means is that, the output from `np.random.choice()`

mustn't contain e.g. more than two `4's`

, because `rep_nums`

has two `4's`

.

So, the problem is we want to get rid of that `np.repeat`

part, which might be a bottleneck for really huge arrays.

Looking at the output from `rep_nums`

, one idea would be to generate `z = 7`

unique elements ranging across the length of `rep_nums`

:

```
In [44]: np.random.choice(rep_nums.size,7,replace=False)
Out[44]: array([ 7, 2, 4, 10, 13, 8, 3])
```

These numbers represent indices for that length. So, we just need to look for the bin (out of the `5`

bins) in `rep_nums`

in which each of those `7`

numbers would go in. For that, we can use `np.searchsorted`

. Thus, we would have an implementation to handle generic `x`

, `y`

, like so -

```
# Get the intervals of those bins
intervals = data.astype(int).cumsum()
# Decide length of array if we had repeated with `np.repeat`
max_num = intervals.max()
# Get unique numbers (indices in this case)
ids = np.random.choice(max_num,z,replace=False)
# Use searchsorted to get bin IDs and add in `x` offset
out = x+np.searchsorted(intervals,ids)
```

**Runtime test**

Functions :

```
def org_app(x,y,z,data):
rep_nums = np.repeat(range(x,y), data.astype('int64'))
out = np.random.choice(rep_nums, z,replace=False)
return out
def optimized_v1(x,y,z,data):
intervals = data.astype(int).cumsum()
max_num = intervals.max()
ids = np.random.choice(max_num,z,replace=False)
out = x+np.searchsorted(intervals,ids)
return out
```

Timings on full functions -

```
In [45]: # Setup inputs
...: x = 100
...: y = 10010
...: z = 1000
...: data = np.random.randint(100,5000,(y-x)).astype(float)
...:
In [46]: %timeit org_app(x,y,z,data)
1 loop, best of 3: 7.17 s per loop
In [47]: %timeit optimized_v1(x,y,z,data)
1 loop, best of 3: 6.91 s per loop
```

Doesn't look like we are getting good speedup. Let's dig deeper and find out how much are we saving on replacing `np.repeat`

!

First off the original approach -

```
In [49]: %timeit np.repeat(range(x,y), data.astype('int64'))
1 loop, best of 3: 229 ms per loop
```

Let's see how much improvement we got on this with the proposed approach. So, let's time everything except `np.random.choice()`

in the proposed approach -

```
In [50]: intervals = data.astype(int).cumsum()
...: max_num = intervals.max()
...: ids = np.random.choice(max_num,z,replace=False)
...: out = x+np.searchsorted(intervals,ids)
...:
In [51]: %timeit data.astype(int).cumsum()
10000 loops, best of 3: 36.7 µs per loop
In [52]: %timeit intervals.max()
10000 loops, best of 3: 19.1 µs per loop
In [53]: %timeit x+np.searchsorted(intervals,ids)
10000 loops, best of 3: 127 µs per loop
```

**This is much better than 229ms from np.repeat!!**

For the sake of completeness, let's time the `np.random.choice`

from these two aproaches as we expect them to be closer on runtimes.

First off the original approach -

```
In [54]: rep_nums = np.repeat(range(x,y), data.astype('int64'))
...: out = np.random.choice(rep_nums, z,replace=False)
...:
In [55]: %timeit np.random.choice(rep_nums, z,replace=False)
1 loop, best of 3: 6.95 s per loop
```

Next up, the optimized version -

```
In [56]: intervals = data.astype(int).cumsum()
...: max_num = intervals.max()
...: ids = np.random.choice(max_num,z,replace=False)
...: out = x+np.searchsorted(intervals,ids)
...:
In [57]: %timeit np.random.choice(max_num,z,replace=False)
1 loop, best of 3: 6.89 s per loop
```

So, we are hoping that at really huge arrays, the benefit from removing `np.repeat`

would really shine, as otherwise `np.random.choice()`

itself looks like the bottleneck.