user99889 user99889 - 4 months ago 64
Python Question

How to obtain reproducible but distinct splits from KFold and GroupKFold sci-kit learn

In the

GroupKFold
source, the
random_state
is set to
None


def __init__(self, n_splits=3):
super(GroupKFold, self).__init__(n_splits, shuffle=False,
random_state=None)


However, when run multiple times (code from here)

import numpy as np
from sklearn.model_selection import GroupKFold

for i in range(0,10):
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

print(group_kfold)

for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
print
print


o/p

GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))


GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))


etc ...

Why is it that the splits are identical even though the
random_state
is
None
? This is the same for a larger dataset.

Further, when I use different
random_state
s ( here I show for
Kfold
, which takes a
random_state
as an argument ),

from sklearn.model_selection import KFold
import sklearn
import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(X)

for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

print '**'

from sklearn.model_selection import KFold
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(X)

for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]


I get the same splits

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
**
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))


How do I set the
random_state
for
Kfold
( which takes the parameter ) AND for
GroupKFold
( which doesn't ) to get different splits with each run, but also reproducible splits? If anyone can show with the code I posted, that answers the question.

Afterword

In the meantime:

import numpy as np
class GroupShuffler:

'''
gs = GroupShuffler(random_state=11, verbose=10)
gs.shuffle(X, y, groups)
'''

def __init__(self, random_state=None, verbose=0):
self.random_state = random_state
self.verbose = verbose

def shuffle(self, X, y, groups):

index = range(len(groups))
rng = np.random.RandomState(self.random_state)
rng.shuffle(index)
if self.verbose > 0:
print "shuffled index", index
return X[index], y[index], groups[index]

Answer Source
  • KFold is only randomized if shuffle=True. Some datasets should not be shuffled.
  • GroupKFold is not randomized at all. Hence the random_state=None.
  • GroupShuffleSplit may be closer to what you're looking for.

A comparison of the group-based splitters:

  • In GroupKFold, the test sets form a complete partition of all the data.
  • LeavePGroupsOut leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means P ** n_groups splits altogether, often you want a small P, and most often want LeaveOneGroupOut which is basically the same as GroupKFold with k=1.
  • GroupShuffleSplit makes no statement about the relationship between successive test sets; each train/test split is performed independently.

As an aside, Dmytro Lituiev has proposed an alternative GroupShuffleSplit algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size.