Demetri P Demetri P - 10 months ago 58
Python Question

How can I sample equally from a dataframe?

Suppose I have some observations, each with an indicated class from

. Each of these classes may not necessarily occur equally in the data set.

How can I equally sample from the dataframe? Right now I do something like...

frames = []
classes = df.classes.unique()

for i in classes:
g = df[df.classes = i].sample(sample_size)

equally_sampled = pd.concat(frames)

Is there a pandas function to equally sample?

Answer Source

For more elegance you can do this:

df.groupby('classes').apply(lambda x: x.sample(sample_size))


You can make the sample_size a function of group size to sample with equal probabilities (or proportionately):

nrows = len(df)
total_sample_size = 1e4
    apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))

It won't result in the exact number of rows as total_sample_size but sampling will be more proportional than the naive method.