TW1411 TW1411 - 6 months ago 25
Python Question

Pandas create random samples without duplicates

I have a pandas dataframe containing ~200,000 rows and I would like to create 5 random samples of 1000 rows each however I do not want any of these samples to contain the same row twice.

To create a random sample I have been using:

import numpy as np
rows = np.random.choice(df.index.values, 1000)
sampled_df = df.ix[rows]


However just doing this several times would run the risk of having duplicates. Would the best way to handle this be keeping track of which rows are sampled each time?

Answer

You can use df.sample.

A dataframe with 100 rows and 5 columns:

df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))

Sample 5 rows:

df.sample(5)
Out[8]: 
           a         b         c         d         e
84  0.012201 -0.053014 -0.952495  0.680935  0.006724
45 -1.347292  1.358781 -0.838931 -0.280550 -0.037584
10 -0.487169  0.999899  0.524546 -1.289632 -0.370625
64  1.542704 -0.971672 -1.150900  0.554445 -1.328722
99  0.012143 -2.450915 -0.718519 -1.192069 -1.268863

This ensures those 5 rows are different. If you want to repeat this process, I'd suggest sampling number_of_rows * number_of_samples rows. For example if each sample is going to contain 5 rows and you need 10 samples, sample 50 rows. The first 5 will be the first sample, the second five will be the second...

all_samples = df.sample(50)
samples = [all_samples.iloc[5*i:5*i+5] for i in range(10)]
Comments