Vladimir Vargas Vladimir Vargas - 17 days ago 7
Python Question

Best practices for prediction of labeled data (numpy)

I have some data in a numpy array of the form

[[sample1], [sample2], ... , [sampleN]]
. I have some labels of the form
[l1, l2, ..., lN]
, where
can take up to 4 different values (meaning that my samples are contained in 4 sets). I want to choose a number of
M < N
samples from my data array and train my predictive model with them. Then, the remaining data is going to be used as test data to check the accuracy of my predictive model.

I am not very familiar with standard practices on building and testing such predictive models. However I heard something like dividing the dataset in two parts, one containing 9/10 of the data that will act as training data, and the other one containing 1/10 of the data that will act as testing data. My question is, is this correct? are there some "best practices" for this? My other question is, how can I select randomly these 2 sets of data from my dataset array?

Thanks a lot for your help.


Indeed you need to split your data. You even need to split it in 3 for validation purposes.

Regarding the random split, you need to make sure that labels and data stay aligned or you will learn nothing but randomness from your data. For instance (pythonic pseudo code since you did not provide any code...)

from random import sample

indices = sample(xrange(N), M) # generate M non repeating indices between 0 and N
remaining_indices = list(set(xrange(N)) - set(indices)) # Use sets to easily get the indices you left behind
train_set = data_set[indices]
train_labels = labels[indices]
test_set =  data_set[remaining_indices]
test_labels = labels[remaining_indices]

You can repeat the process to split the test data into test+validation. Also look into cross validation.

As mentionned by @Sascha it is also all built in Scikit-learn, a very useful machine learning Python package