Freya Ren Freya Ren - 4 months ago 20
Python Question

How to split data into trainset and testset randomly?

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.

My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.

This could be achieved easily in Matlab

C = textscan(fid, '%s','delimiter', '\n');
for i=1:50
trainstring = C{plist(i)};
for i=51:100
teststring = C{plist(i)};

But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.


This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

import random

with open("datafile.txt", "rb") as f:
    data ='\n')


train_data = data[:50]
test_data = data[50:]