user99889 - 7 months ago 99
Python Question

# How to obtain reproducible but distinct instances of GroupKFold

In the

`GroupKFold`
source, the
`random_state`
is set to
`None`

``````    def __init__(self, n_splits=3):
super(GroupKFold, self).__init__(n_splits, shuffle=False,
random_state=None)
``````

Hence, when run multiple times (code from here)

``````import numpy as np
from sklearn.model_selection import GroupKFold

for i in range(0,10):
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

print(group_kfold)

for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
print
print
``````

o/p

``````GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))

GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
``````

etc ...

The splits are identical.

How do I set a
`random_state`
for
`GroupKFold`
in order to get a different (but repoducible) set of splits over a few different trials of cross validation?

Eg, I want

``````GroupKFold(n_splits=2, random_state=42)
('TRAIN:', array([0, 1]),
'TEST:', array([2, 3]))

('TRAIN:', array([2, 3]),
'TEST:', array([0, 1]))

GroupKFold(n_splits=2, random_state=13)
('TRAIN:', array([0, 2]),
'TEST:', array([1, 3]))

('TRAIN:', array([1, 3]),
'TEST:', array([0, 2]))
``````

So far, it seems a good strategy is to use a
`sklearn.utils.shuffle`
first, as suggested in this post.

• `KFold` is only randomized if `shuffle=True`. Some datasets should not be shuffled.
• `GroupKFold` is not randomized at all. Hence the `random_state=None`.
• `GroupShuffleSplit` may be closer to what you're looking for.
• In `GroupKFold`, the test sets form a complete partition of all the data.
• `LeavePGroupsOut` leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means `P ** n_groups` splits altogether, often you want a small P, and most often want `LeaveOneGroupOut` which is basically the same as `GroupKFold` with `k=1`.
• `GroupShuffleSplit` makes no statement about the relationship between successive test sets; each train/test split is performed independently.
As an aside, Dmytro Lituiev has proposed an alternative `GroupShuffleSplit` algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified `test_size`.