Eran Moshe - 6 days ago 7
Python Question

# Python - Sample from 'x' lists with different distribution

In the following code, I've created a list of items and users. I've separated the items to 3 different lists of very popular, popular and regular items.

``````import numpy as np

N_USERS = 20000
N_ITEMS = 1000

items = range(0, N_ITEMS)
users = range(0, N_USERS)

vpop = int(len(items)*0.1)
pop = int(len(items)*0.3)

np.random.shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items [pop:]
``````

I want to sample
`X`
samples from those lists with different distribution. For example:

``````list_of_items = sample(vpop_items, pop_items, reg_items, p = [0.5, 0.35, 0.15], X)
``````

where
`X`
is the number of samples I want to make, and
`P`
is the list of distributions corresponding with the lists (
`vpop_items`
,
`pop_items`
,
`reg_items`
).

So in the end I will have
`X`
"items" in
`list_of_items`
.

Let's say
`X = 100`
. I want 100 samples in total, with 0.5 chance from
`vpop_items`
, 0.35 chance from
`pop_items`
, and 0.15 chance from
`reg_items`
. The sampling must be done without replacement, that is, no item can be selected more than once.

Here's a plain Python algorithm that does what you want. It's more efficient than what you're currently doing, but I'm sure there's a smarter way to do it. :)

Let `num` be the total number of samples wanted. We first generate `num` random numbers in the range 0 - 1, and test them against the desired cumulative probabilities, keeping count of how many numbers occur in each probability range. Next, we sample each sequence, using the counts we found in the first step as the sample size. Finally, we shuffle those samples together.

In the code below I've commented out the lines that do the shuffling to make it easier to see what's going on while testing the code.

``````from random import seed, random, sample, shuffle
from itertools import accumulate

def multi_sample(seqs, probs, num):
''' Sample from each sequence in list/tuple `seqs` with the corresponding
probability in list/tuple `probs`. Return a list containing `num` samples
'''
# Compute the cumulative probability
# This really should raise ValueError if aprobs[-1] != 1.0
# and we ought to check that len(seqs) == len(probs)...
aprobs = list(accumulate(probs))

# Determine how many samples to take from each seq
counts = [0] * len(seqs)
for _ in range(num):
x = random()
for i, p in enumerate(aprobs):
if x < p:
break
counts[i] += 1

lst = []
for seq, count in zip(seqs, counts):
lst.extend(sample(seq, count))

#shuffle(lst)
return lst

# Test

N_ITEMS = 1000
items = list(range(N_ITEMS))
vpop = int(N_ITEMS * 0.1)
pop = int(N_ITEMS * 0.3)

#shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items[pop:]

all_items = (vpop_items, pop_items, reg_items)

list_of_items = multi_sample(all_items, probs=[0.5, 0.35, 0.15], num=100)
print(list_of_items)

# Verify

#list_of_items.sort()
#print(list_of_items)

# Should be ~50
print(sum(1 for x in list_of_items if x < vpop))
# Should be ~35
print(sum(1 for x in list_of_items if vpop <= x < pop))
``````

typical output

``````[65, 16, 81, 97, 30, 33, 52, 92, 96, 72, 50, 4, 75, 7, 44, 18, 90, 9, 91, 56, 85, 28, 84, 88, 76, 21, 14, 77, 8, 59, 22, 34, 93, 95, 63, 10, 99, 41, 60, 36, 66, 2, 13, 64, 51, 43, 11, 106, 153, 235, 189, 132, 150, 226, 196, 247, 245, 194, 172, 227, 202, 256, 163, 205, 131, 192, 295, 147, 246, 108, 291, 155, 128, 171, 141, 124, 102, 210, 294, 284, 276, 148, 122, 290, 948, 566, 894, 884, 310, 476, 562, 313, 357, 846, 794, 317, 335, 599, 370, 988]
47
37
``````

Be aware that this function can fail: if you call `sample(seq, count)` where `count > len(seq)` it will raise `ValueError: Sample larger than population`. So you need to ensure that `num` is small enough so that that can't occur. To be totally safe, make sure that `num` is <= than the length of the smallest sequence. With the given data, `num` is 100 and the smallest sequence is `vpop_items`, which contains 100 items, so we don't need to worry.

Thanks to Andras Deak for bringing this important point to my attention.

As I said earlier, there's bound to be a smarter way to do this: rather than calculating `counts` in a loop we should be able to just generate those counts directly using appropriate mathematics, but I'm afraid I don't know (or don't remember) how to do that. Of course, we could "cheat". :) Using the given data, we want approximately 50 items from `vpop_items`, 35 items from `pop_items` and the remaining 15 items from `reg_items`. So we could set `counts` to `[50, 35, 15]` and then make a small random adjustment to each count, taking care to keep the total equal to 100.