user99889 - 9 months ago 89
Python Question

# How to obtain reproducible but distinct splits from KFold and GroupKFold sci-kit learn

In the

`GroupKFold`
source, the
`random_state`
is set to
`None`

``````    def __init__(self, n_splits=3):
super(GroupKFold, self).__init__(n_splits, shuffle=False,
random_state=None)
``````

However, when run multiple times (code from here)

``````import numpy as np
from sklearn.model_selection import GroupKFold

for i in range(0,10):
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

print(group_kfold)

for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
print
print
``````

o/p

``````GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))

GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
``````

etc ...

Why is it that the splits are identical even though the
`random_state`
is
`None`
? This is the same for a larger dataset.

Further, when I use different
`random_state`
s ( here I show for
`Kfold`
, which takes a
`random_state`
as an argument ),

``````from sklearn.model_selection import KFold
import sklearn
import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(X)

for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

print '**'

from sklearn.model_selection import KFold
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(X)

for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
``````

I get the same splits

``````('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
**
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
``````

How do I set the
`random_state`
for
`Kfold`
( which takes the parameter ) AND for
`GroupKFold`
( which doesn't ) to get different splits with each run, but also reproducible splits? If anyone can show with the code I posted, that answers the question.

Afterword

In the meantime:

``````import numpy as np
class GroupShuffler:

'''
gs = GroupShuffler(random_state=11, verbose=10)
gs.shuffle(X, y, groups)
'''

def __init__(self, random_state=None, verbose=0):
self.random_state = random_state
self.verbose = verbose

def shuffle(self, X, y, groups):

index = range(len(groups))
rng = np.random.RandomState(self.random_state)
rng.shuffle(index)
if self.verbose > 0:
print "shuffled index", index
return X[index], y[index], groups[index]
``````

• `KFold` is only randomized if `shuffle=True`. Some datasets should not be shuffled.
• `GroupKFold` is not randomized at all. Hence the `random_state=None`.
• `GroupShuffleSplit` may be closer to what you're looking for.
• In `GroupKFold`, the test sets form a complete partition of all the data.
• `LeavePGroupsOut` leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means `P ** n_groups` splits altogether, often you want a small P, and most often want `LeaveOneGroupOut` which is basically the same as `GroupKFold` with `k=1`.
• `GroupShuffleSplit` makes no statement about the relationship between successive test sets; each train/test split is performed independently.
As an aside, Dmytro Lituiev has proposed an alternative `GroupShuffleSplit` algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified `test_size`.