TW1411 - 8 months ago 42

Python Question

I have a pandas dataframe containing ~200,000 rows and I would like to create 5 random samples of 1000 rows each however I do not want any of these samples to contain the same row twice.

To create a random sample I have been using:

`import numpy as np`

rows = np.random.choice(df.index.values, 1000)

sampled_df = df.ix[rows]

However just doing this several times would run the risk of having duplicates. Would the best way to handle this be keeping track of which rows are sampled each time?

Answer

You can use `df.sample`

.

A dataframe with 100 rows and 5 columns:

```
df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
```

Sample 5 rows:

```
df.sample(5)
Out[8]:
a b c d e
84 0.012201 -0.053014 -0.952495 0.680935 0.006724
45 -1.347292 1.358781 -0.838931 -0.280550 -0.037584
10 -0.487169 0.999899 0.524546 -1.289632 -0.370625
64 1.542704 -0.971672 -1.150900 0.554445 -1.328722
99 0.012143 -2.450915 -0.718519 -1.192069 -1.268863
```

This ensures those 5 rows are different. If you want to repeat this process, I'd suggest sampling number_of_rows * number_of_samples rows. For example if each sample is going to contain 5 rows and you need 10 samples, sample 50 rows. The first 5 will be the first sample, the second five will be the second...

```
all_samples = df.sample(50)
samples = [all_samples.iloc[5*i:5*i+5] for i in range(10)]
```