Matti Lyra - 2 months ago 15

Python Question

Do you know if there is a way to get python's

`random.sample`

`random.sample()`

`TypeError: object of type 'generator' has no len()`

I was thinking that maybe there is some way of doing this with something from

`itertools`

A somewhat made up example:

`import random`

def list_item(ls):

for item in ls:

yield item

random.sample( list_item(range(100)), 20 )

As per

`MartinPieters`

`Sampling 1000 from 10000`

Using iterSample 0.0163 s

Using sample_from_iterable 0.0098 s

Using iter_sample_fast 0.0148 s

Sampling 10000 from 100000

Using iterSample 0.1786 s

Using sample_from_iterable 0.1320 s

Using iter_sample_fast 0.1576 s

Sampling 100000 from 1000000

Using iterSample 3.2740 s

Using sample_from_iterable 1.9860 s

Using iter_sample_fast 1.4586 s

Sampling 200000 from 1000000

Using iterSample 7.6115 s

Using sample_from_iterable 3.0663 s

Using iter_sample_fast 1.4101 s

Sampling 500000 from 1000000

Using iterSample 39.2595 s

Using sample_from_iterable 4.9994 s

Using iter_sample_fast 1.2178 s

Sampling 2000000 from 5000000

Using iterSample 798.8016 s

Using sample_from_iterable 28.6618 s

Using iter_sample_fast 6.6482 s

So it turns out that the

`array.insert`

`from heapq import nlargest`

import random

import timeit

def iterSample(iterable, samplesize):

results = []

for i, v in enumerate(iterable):

r = random.randint(0, i)

if r < samplesize:

if i < samplesize:

results.insert(r, v) # add first samplesize items in random order

else:

results[r] = v # at a decreasing rate, replace random items

if len(results) < samplesize:

raise ValueError("Sample larger than population.")

return results

def sample_from_iterable(iterable, samplesize):

return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))

def iter_sample_fast(iterable, samplesize):

results = []

iterator = iter(iterable)

# Fill in the first samplesize elements:

for _ in xrange(samplesize):

results.append(iterator.next())

random.shuffle(results) # Randomize their positions

for i, v in enumerate(iterator, samplesize):

r = random.randint(0, i)

if r < samplesize:

results[r] = v # at a decreasing rate, replace random items

if len(results) < samplesize:

raise ValueError("Sample larger than population.")

return results

if __name__ == '__main__':

pop_sizes = [int(10e+3),int(10e+4),int(10e+5),int(10e+5),int(10e+5),int(10e+5)*5]

k_sizes = [int(10e+2),int(10e+3),int(10e+4),int(10e+4)*2,int(10e+4)*5,int(10e+5)*2]

for pop_size, k_size in zip(pop_sizes, k_sizes):

pop = xrange(pop_size)

k = k_size

t1 = timeit.Timer(stmt='iterSample(pop, %i)'%(k_size), setup='from __main__ import iterSample,pop')

t2 = timeit.Timer(stmt='sample_from_iterable(pop, %i)'%(k_size), setup='from __main__ import sample_from_iterable,pop')

t3 = timeit.Timer(stmt='iter_sample_fast(pop, %i)'%(k_size), setup='from __main__ import iter_sample_fast,pop')

print 'Sampling', k, 'from', pop_size

print 'Using iterSample', '%1.4f s'%(t1.timeit(number=100) / 100.0)

print 'Using sample_from_iterable', '%1.4f s'%(t2.timeit(number=100) / 100.0)

print 'Using iter_sample_fast', '%1.4f s'%(t3.timeit(number=100) / 100.0)

print ''

I also ran a test to check that all the methods indeed do take an unbiased sample of the generator. So for all methods I sampled

`1000`

`10000`

`100000`

`~.1`

Answer

While the answer of Martijn Pieters is correct, it does slow down when `samplesize`

becomes large, because using `list.insert`

in a loop may have quadratic complexity.

Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:

```
def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
try:
for _ in xrange(samplesize):
results.append(iterator.next())
except StopIteration:
raise ValueError("Sample larger than population.")
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
return results
```

The difference slowly starts to show for `samplesize`

values above `10000`

. Times for calling with `(1000000, 100000)`

:

- iterSample: 5.05s
- iter_sample_fast: 2.64s