Eran Moshe Eran Moshe - 6 days ago 7
Python Question

Python - Sample from 'x' lists with different distribution

In the following code, I've created a list of items and users. I've separated the items to 3 different lists of very popular, popular and regular items.

import numpy as np


N_USERS = 20000
N_ITEMS = 1000

items = range(0, N_ITEMS)
users = range(0, N_USERS)

vpop = int(len(items)*0.1)
pop = int(len(items)*0.3)

np.random.shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items [pop:]


I want to sample
X
samples from those lists with different distribution. For example:

list_of_items = sample(vpop_items, pop_items, reg_items, p = [0.5, 0.35, 0.15], X)


where
X
is the number of samples I want to make, and
P
is the list of distributions corresponding with the lists (
vpop_items
,
pop_items
,
reg_items
).

So in the end I will have
X
"items" in
list_of_items
.

Let's say
X = 100
. I want 100 samples in total, with 0.5 chance from
vpop_items
, 0.35 chance from
pop_items
, and 0.15 chance from
reg_items
. The sampling must be done without replacement, that is, no item can be selected more than once.

Answer

Here's a plain Python algorithm that does what you want. It's more efficient than what you're currently doing, but I'm sure there's a smarter way to do it. :)

Let num be the total number of samples wanted. We first generate num random numbers in the range 0 - 1, and test them against the desired cumulative probabilities, keeping count of how many numbers occur in each probability range. Next, we sample each sequence, using the counts we found in the first step as the sample size. Finally, we shuffle those samples together.

In the code below I've commented out the lines that do the shuffling to make it easier to see what's going on while testing the code.

from random import seed, random, sample, shuffle
from itertools import accumulate

def multi_sample(seqs, probs, num):
    ''' Sample from each sequence in list/tuple `seqs` with the corresponding 
        probability in list/tuple `probs`. Return a list containing `num` samples
    '''
    # Compute the cumulative probability
    # This really should raise ValueError if aprobs[-1] != 1.0
    # and we ought to check that len(seqs) == len(probs)...
    aprobs = list(accumulate(probs))

    # Determine how many samples to take from each seq
    counts = [0] * len(seqs)
    for _ in range(num):
        x = random()
        for i, p in enumerate(aprobs):
            if x < p:
                break
        counts[i] += 1

    lst = []
    for seq, count in zip(seqs, counts):
        lst.extend(sample(seq, count))

    #shuffle(lst)
    return lst

# Test

N_ITEMS = 1000
items = list(range(N_ITEMS))
vpop = int(N_ITEMS * 0.1)
pop = int(N_ITEMS * 0.3)

#shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items[pop:]

all_items = (vpop_items, pop_items, reg_items)

list_of_items = multi_sample(all_items, probs=[0.5, 0.35, 0.15], num=100)
print(list_of_items)

# Verify

#list_of_items.sort()
#print(list_of_items)

# Should be ~50
print(sum(1 for x in list_of_items if x < vpop))
# Should be ~35
print(sum(1 for x in list_of_items if vpop <= x < pop))

typical output

[65, 16, 81, 97, 30, 33, 52, 92, 96, 72, 50, 4, 75, 7, 44, 18, 90, 9, 91, 56, 85, 28, 84, 88, 76, 21, 14, 77, 8, 59, 22, 34, 93, 95, 63, 10, 99, 41, 60, 36, 66, 2, 13, 64, 51, 43, 11, 106, 153, 235, 189, 132, 150, 226, 196, 247, 245, 194, 172, 227, 202, 256, 163, 205, 131, 192, 295, 147, 246, 108, 291, 155, 128, 171, 141, 124, 102, 210, 294, 284, 276, 148, 122, 290, 948, 566, 894, 884, 310, 476, 562, 313, 357, 846, 794, 317, 335, 599, 370, 988]
47
37

Be aware that this function can fail: if you call sample(seq, count) where count > len(seq) it will raise ValueError: Sample larger than population. So you need to ensure that num is small enough so that that can't occur. To be totally safe, make sure that num is <= than the length of the smallest sequence. With the given data, num is 100 and the smallest sequence is vpop_items, which contains 100 items, so we don't need to worry.

Thanks to Andras Deak for bringing this important point to my attention.


As I said earlier, there's bound to be a smarter way to do this: rather than calculating counts in a loop we should be able to just generate those counts directly using appropriate mathematics, but I'm afraid I don't know (or don't remember) how to do that. Of course, we could "cheat". :) Using the given data, we want approximately 50 items from vpop_items, 35 items from pop_items and the remaining 15 items from reg_items. So we could set counts to [50, 35, 15] and then make a small random adjustment to each count, taking care to keep the total equal to 100.

Comments