In the following code, I've created a list of items and users. I've separated the items to 3 different lists of very popular, popular and regular items.
import numpy as np
N_USERS = 20000
N_ITEMS = 1000
items = range(0, N_ITEMS)
users = range(0, N_USERS)
vpop = int(len(items)*0.1)
pop = int(len(items)*0.3)
np.random.shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items [pop:]
X
list_of_items = sample(vpop_items, pop_items, reg_items, p = [0.5, 0.35, 0.15], X)
X
P
vpop_items
pop_items
reg_items
X
list_of_items
X = 100
vpop_items
pop_items
reg_items
Here's a plain Python algorithm that does what you want. It's more efficient than what you're currently doing, but I'm sure there's a smarter way to do it. :)
Let num
be the total number of samples wanted. We first generate num
random numbers in the range 0 - 1, and test them against the desired cumulative probabilities, keeping count of how many numbers occur in each probability range. Next, we sample each sequence, using the counts we found in the first step as the sample size. Finally, we shuffle those samples together.
In the code below I've commented out the lines that do the shuffling to make it easier to see what's going on while testing the code.
from random import seed, random, sample, shuffle
from itertools import accumulate
def multi_sample(seqs, probs, num):
''' Sample from each sequence in list/tuple `seqs` with the corresponding
probability in list/tuple `probs`. Return a list containing `num` samples
'''
# Compute the cumulative probability
# This really should raise ValueError if aprobs[-1] != 1.0
# and we ought to check that len(seqs) == len(probs)...
aprobs = list(accumulate(probs))
# Determine how many samples to take from each seq
counts = [0] * len(seqs)
for _ in range(num):
x = random()
for i, p in enumerate(aprobs):
if x < p:
break
counts[i] += 1
lst = []
for seq, count in zip(seqs, counts):
lst.extend(sample(seq, count))
#shuffle(lst)
return lst
# Test
N_ITEMS = 1000
items = list(range(N_ITEMS))
vpop = int(N_ITEMS * 0.1)
pop = int(N_ITEMS * 0.3)
#shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items[pop:]
all_items = (vpop_items, pop_items, reg_items)
list_of_items = multi_sample(all_items, probs=[0.5, 0.35, 0.15], num=100)
print(list_of_items)
# Verify
#list_of_items.sort()
#print(list_of_items)
# Should be ~50
print(sum(1 for x in list_of_items if x < vpop))
# Should be ~35
print(sum(1 for x in list_of_items if vpop <= x < pop))
typical output
[65, 16, 81, 97, 30, 33, 52, 92, 96, 72, 50, 4, 75, 7, 44, 18, 90, 9, 91, 56, 85, 28, 84, 88, 76, 21, 14, 77, 8, 59, 22, 34, 93, 95, 63, 10, 99, 41, 60, 36, 66, 2, 13, 64, 51, 43, 11, 106, 153, 235, 189, 132, 150, 226, 196, 247, 245, 194, 172, 227, 202, 256, 163, 205, 131, 192, 295, 147, 246, 108, 291, 155, 128, 171, 141, 124, 102, 210, 294, 284, 276, 148, 122, 290, 948, 566, 894, 884, 310, 476, 562, 313, 357, 846, 794, 317, 335, 599, 370, 988]
47
37
Be aware that this function can fail: if you call sample(seq, count)
where count > len(seq)
it will raise ValueError: Sample larger than population
. So you need to ensure that num
is small enough so that that can't occur. To be totally safe, make sure that num
is <= than the length of the smallest sequence. With the given data, num
is 100 and the smallest sequence is vpop_items
, which contains 100 items, so we don't need to worry.
Thanks to Andras Deak for bringing this important point to my attention.
As I said earlier, there's bound to be a smarter way to do this: rather than calculating counts
in a loop we should be able to just generate those counts directly using appropriate mathematics, but I'm afraid I don't know (or don't remember) how to do that. Of course, we could "cheat". :) Using the given data, we want approximately 50 items from vpop_items
, 35 items from pop_items
and the remaining 15 items from reg_items
. So we could set counts
to [50, 35, 15]
and then make a small random adjustment to each count, taking care to keep the total equal to 100.