I have some data in a numpy array of the form
[[sample1], [sample2], ... , [sampleN]]
[l1, l2, ..., lN]
M < N
Indeed you need to split your data. You even need to split it in 3 for validation purposes.
Regarding the random split, you need to make sure that labels and data stay aligned or you will learn nothing but randomness from your data. For instance (pythonic pseudo code since you did not provide any code...)
from random import sample indices = sample(xrange(N), M) # generate M non repeating indices between 0 and N remaining_indices = list(set(xrange(N)) - set(indices)) # Use sets to easily get the indices you left behind train_set = data_set[indices] train_labels = labels[indices] test_set = data_set[remaining_indices] test_labels = labels[remaining_indices]
You can repeat the process to split the test data into test+validation. Also look into cross validation.
As mentionned by @Sascha it is also all built in Scikit-learn, a very useful machine learning Python package