mom mom - 3 months ago 15
Python Question

Python script to move files to either a train dir or test dir

I am the moment making a python script capable of diving my data into either a train dir or test dir. I provide the script with an ratio, which says what the ratio between train/test should be, an according to that should files randomly be moved either to train or test.

ex. if the ratio = 0.5 then would half of my dataset be in train and the other half in test.

other ex. if the ratio = 0.25 then would 75% dataset be in train and the rest in test.

But the division seem to wrong everytime.. I am trying to seperate 84 files/dirs and can't seem to hit the golden 42/42 seperation.. Any suggesting what could i do differently?

Here is the code:

import sys
import os
import shutil
import numpy
import random


src = sys.argv[1]
destination_data = sys.argv[2]

src_abs = os.path.abspath(src)
destination_data_abs = os.path.abspath(destination_data)

src_files = os.listdir(src_abs)


def copytree(src, dst, symlinks=False, ignore=None, split=0.5):
for item in os.listdir(src):
s = os.path.join(src, item)
d = os.path.join(dst, item)
d_test = os.path.join(dst, 'test', item)
d_train = os.path.join(dst, 'train', item)

print d_test
print d_train
minmax=0.0, 1.0
rand = random.uniform(*minmax)
print rand
if rand > split:
# Inserted into train
if os.path.isdir(s):
shutil.copytree(s, d_train, symlinks, ignore)
print "Copytree used! - TRAIN"
else:
shutil.copy2(s, d_train)
print "Copy 2 used! - TRAIN"
else:
# Inserted into test
if os.path.isdir(s):
shutil.copytree(s, d_test, symlinks, ignore)
print "Copytree used! - TEST"
else:
shutil.copy2(s, d_test)
print "Copy 2 used! - TEST"

copytree(src_abs,destination_data_abs,True)


the code is being executed on a unix machine ... if that matters?

Answer

You can take the list of files, shuffle it, then split it with respect to the split ratio.

import os
import numpy

src_files = os.listdir(".")
n_files = len(src_files)

split_ratio = 0.5
split_index = int(n_files * split_ratio)

numpy.random.shuffle(src_files)

print src_files[0:split_index]
print src_files[split_index:]

Flipping a coin 84 times will result in a "perfect" 42 heads / 42 tails with a probability of 0.0868.