Demetri P Demetri P - 21 days ago 7
Python Question

How can I sample equally from a dataframe?

Suppose I have some observations, each with an indicated class from

1
to
n
. Each of these classes may not necessarily occur equally in the data set.

How can I equally sample from the dataframe? Right now I do something like...

frames = []
classes = df.classes.unique()

for i in classes:
g = df[df.classes = i].sample(sample_size)
frames.append(g)

equally_sampled = pd.concat(frames)


Is there a pandas function to equally sample?

Answer

For more elegance you can do this:

df.groupby('classes').apply(lambda x: x.sample(sample_size))

Extension:

You can make the sample_size a function of group size to sample with equal probabilities (or proportionately):

nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
    apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))

It won't result in the exact number of rows as total_sample_size but sampling will be more proportional than the naive method.