erik - 11 months ago 883

Python Question

What is a good way to split a numpy array randomly into training and testing / validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.

Answer

If you want to divide the data set once in two halves, you can use `numpy.random.shuffle`

, or `numpy.random.permutation`

if you need to keep track of the indices:

```
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
```

or

```
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
```

There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:

```
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]
```

Finally, scikits.learn contains several cross validation methods (k-fold, leave-n-out, stratified-k-fold, ...). For the docs you might need to look at the examples or the latest git repository, but the code looks solid.