Diogo Santos Diogo Santos - 3 months ago 28
Python Question

Efficient use of numpy.random.choice with repeated numbers and alternatives

I need to generate a large array with repeated elements, and my code is:

np.repeat(xrange(x,y), data)


However, data is a numpy array with type float64 (but it represent integeres, no 2.1 there) and I get the error

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'


Exemple:

In [35]: x
Out[35]: 26

In [36]: y
Out[36]: 50

In [37]: data
Out[37]:
array([ 3269., 106., 5533., 317., 1512., 208., 502., 919.,
406., 421., 1690., 2236., 705., 505., 230., 213.,
307., 1628., 4389., 1491., 355., 103., 854., 424.])
In [38]: np.repeat(xrange(x,y), data)

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-105860821359> in <module>()
----> 1 np.repeat(xrange(x,y), data)

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy /core/fromnumeric.pyc in repeat(a, repeats, axis)
394 repeat = a.repeat
395 except AttributeError:
--> 396 return _wrapit(a, 'repeat', repeats, axis)
397 return repeat(repeats, axis)
398

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy /core/fromnumeric.pyc in _wrapit(obj, method, *args, **kwds)
46 except AttributeError:
47 wrap = None
---> 48 result = getattr(asarray(obj), method)(*args, **kwds)
49 if wrap:
50 if not isinstance(result, mu.ndarray):

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'


I solve it by changing the code to

np.repeat(xrange(x,y), data.astype('int64'))


However, this is now one of the most expensive lines in my code!! Is there another alternative?

By the way, I using this inside

np.random.choice(np.repeat(xrange(x,y), data.astype('int64')), z)


in order to get a sample without replacement with size z of the integers between x and y, with the number of each given in data. I guess this is the best approach for that also right?

Answer

Problem statement

Pretty interesting problem this one! Just to give the readers an idea about the problem without going into the minor data conversion issues, we have a range of values, let's say a = np.arange(5), i.e.

a = np.array([0,1,2,3,4])

Now, let's say we have another array with the number of repetitions listed for each of the 5 numbers in a. So, let those be :

reps = np.array([2,4,6,2,2])

Next up, we are performing those repetitions :

In [32]: rep_nums = np.repeat(a,reps)

In [33]: rep_nums
Out[33]: array([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4])

Finally, we are looking to choose z number of elements out of those repeated numbers using np.random.choice() and without replacement.

Let's say z = 7 to choose 7 elements, so with np.random.choice(), we would have :

In [34]: np.random.choice(rep_nums,7,replace=False)
Out[34]: array([2, 4, 0, 2, 4, 1, 2])

Now, this without replacement term here might sound confusing, as we already have repeated numbers in rep_nums. But, what it essentially means is that, the output from np.random.choice() mustn't contain e.g. more than two 4's, because rep_nums has two 4's.

So, the problem is we want to get rid of that np.repeat part, which might be a bottleneck for really huge arrays.

Proposed approach

Looking at the output from rep_nums, one idea would be to generate z = 7 unique elements ranging across the length of rep_nums :

In [44]: np.random.choice(rep_nums.size,7,replace=False)
Out[44]: array([ 7,  2,  4, 10, 13,  8,  3])

These numbers represent indices for that length. So, we just need to look for the bin (out of the 5 bins) in rep_nums in which each of those 7 numbers would go in. For that, we can use np.searchsorted. Thus, we would have an implementation to handle generic x, y, like so -

# Get the intervals of those bins
intervals = data.astype(int).cumsum()

# Decide length of array if we had repeated with `np.repeat`
max_num = intervals.max()

# Get unique numbers (indices in this case)
ids = np.random.choice(max_num,z,replace=False)

# Use searchsorted to get bin IDs and add in `x` offset
out = x+np.searchsorted(intervals,ids)

Runtime test

Functions :

def org_app(x,y,z,data):
    rep_nums = np.repeat(range(x,y), data.astype('int64'))
    out = np.random.choice(rep_nums, z,replace=False)
    return out

def optimized_v1(x,y,z,data):     
    intervals = data.astype(int).cumsum()
    max_num = intervals.max()
    ids = np.random.choice(max_num,z,replace=False)
    out = x+np.searchsorted(intervals,ids)
    return out

Timings on full functions -

In [45]: # Setup inputs
    ...: x = 100
    ...: y = 10010
    ...: z = 1000
    ...: data = np.random.randint(100,5000,(y-x)).astype(float)
    ...: 

In [46]: %timeit org_app(x,y,z,data)
1 loop, best of 3: 7.17 s per loop

In [47]: %timeit optimized_v1(x,y,z,data)
1 loop, best of 3: 6.91 s per loop

Doesn't look like we are getting good speedup. Let's dig deeper and find out how much are we saving on replacing np.repeat!

First off the original approach -

In [49]: %timeit np.repeat(range(x,y), data.astype('int64'))
1 loop, best of 3: 229 ms per loop

Let's see how much improvement we got on this with the proposed approach. So, let's time everything except np.random.choice() in the proposed approach -

In [50]: intervals = data.astype(int).cumsum()
    ...: max_num = intervals.max()
    ...: ids = np.random.choice(max_num,z,replace=False)
    ...: out = x+np.searchsorted(intervals,ids)
    ...: 

In [51]: %timeit data.astype(int).cumsum()
10000 loops, best of 3: 36.7 µs per loop

In [52]: %timeit intervals.max()
10000 loops, best of 3: 19.1 µs per loop

In [53]: %timeit x+np.searchsorted(intervals,ids)
10000 loops, best of 3: 127 µs per loop

This is much better than 229ms from np.repeat!!

For the sake of completeness, let's time the np.random.choice from these two aproaches as we expect them to be closer on runtimes.

First off the original approach -

In [54]: rep_nums = np.repeat(range(x,y), data.astype('int64'))
    ...: out = np.random.choice(rep_nums, z,replace=False)
    ...: 

In [55]: %timeit np.random.choice(rep_nums, z,replace=False)
1 loop, best of 3: 6.95 s per loop

Next up, the optimized version -

In [56]: intervals = data.astype(int).cumsum()
    ...: max_num = intervals.max()
    ...: ids = np.random.choice(max_num,z,replace=False)
    ...: out = x+np.searchsorted(intervals,ids)
    ...: 

In [57]: %timeit np.random.choice(max_num,z,replace=False)
1 loop, best of 3: 6.89 s per loop

So, we are hoping that at really huge arrays, the benefit from removing np.repeat would really shine, as otherwise np.random.choice() itself looks like the bottleneck.

Comments