mom mom - 1 year ago 57
Python Question

Python script to move files to either a train dir or test dir

I am the moment making a python script capable of diving my data into either a train dir or test dir. I provide the script with an ratio, which says what the ratio between train/test should be, an according to that should files randomly be moved either to train or test.

ex. if the ratio = 0.5 then would half of my dataset be in train and the other half in test.

other ex. if the ratio = 0.25 then would 75% dataset be in train and the rest in test.

But the division seem to wrong everytime.. I am trying to seperate 84 files/dirs and can't seem to hit the golden 42/42 seperation.. Any suggesting what could i do differently?

Here is the code:

import sys
import os
import shutil
import numpy
import random


src = sys.argv[1]
destination_data = sys.argv[2]

src_abs = os.path.abspath(src)
destination_data_abs = os.path.abspath(destination_data)

src_files = os.listdir(src_abs)


def copytree(src, dst, symlinks=False, ignore=None, split=0.5):
for item in os.listdir(src):
s = os.path.join(src, item)
d = os.path.join(dst, item)
d_test = os.path.join(dst, 'test', item)
d_train = os.path.join(dst, 'train', item)

print d_test
print d_train
minmax=0.0, 1.0
rand = random.uniform(*minmax)
print rand
if rand > split:
# Inserted into train
if os.path.isdir(s):
shutil.copytree(s, d_train, symlinks, ignore)
print "Copytree used! - TRAIN"
else:
shutil.copy2(s, d_train)
print "Copy 2 used! - TRAIN"
else:
# Inserted into test
if os.path.isdir(s):
shutil.copytree(s, d_test, symlinks, ignore)
print "Copytree used! - TEST"
else:
shutil.copy2(s, d_test)
print "Copy 2 used! - TEST"

copytree(src_abs,destination_data_abs,True)


the code is being executed on a unix machine ... if that matters?

Answer Source

You can take the list of files, shuffle it, then split it with respect to the split ratio.

import os
import numpy

src_files = os.listdir(".")
n_files = len(src_files)

split_ratio = 0.5
split_index = int(n_files * split_ratio)

numpy.random.shuffle(src_files)

print src_files[0:split_index]
print src_files[split_index:]

Flipping a coin 84 times will result in a "perfect" 42 heads / 42 tails with a probability of 0.0868.