Vladimir Vargas - 2 months ago 15

Python Question

I have some data in a numpy array of the form

`[[sample1], [sample2], ... , [sampleN]]`

`[l1, l2, ..., lN]`

`li`

`M < N`

I am not very familiar with standard practices on building and testing such predictive models. However I heard something like dividing the dataset in two parts, one containing 9/10 of the data that will act as training data, and the other one containing 1/10 of the data that will act as testing data. My question is, is this correct? are there some "best practices" for this? My other question is, how can I select randomly these 2 sets of data from my dataset array?

Thanks a lot for your help.

Answer

Indeed you need to split your data. You even need to split it in 3 for **validation** purposes.

Regarding the random split, you need to make sure that labels and data stay aligned or you will learn nothing but randomness from your data. For instance (pythonic *pseudo* code since you did not provide any code...)

```
from random import sample
indices = sample(xrange(N), M) # generate M non repeating indices between 0 and N
remaining_indices = list(set(xrange(N)) - set(indices)) # Use sets to easily get the indices you left behind
train_set = data_set[indices]
train_labels = labels[indices]
test_set = data_set[remaining_indices]
test_labels = labels[remaining_indices]
```

You can repeat the process to split the test data into test+validation. Also look into **cross validation**.

As mentionned by @Sascha it is also all built in **Scikit-learn**, a very useful machine learning Python package