user99889 - 2 months ago 27

Python Question

In the

`GroupKFold`

`random_state`

`None`

`def __init__(self, n_splits=3):`

super(GroupKFold, self).__init__(n_splits, shuffle=False,

random_state=None)

Hence, when run multiple times (code from here)

`import numpy as np`

from sklearn.model_selection import GroupKFold

for i in range(0,10):

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

y = np.array([1, 2, 3, 4])

groups = np.array([0, 0, 2, 2])

group_kfold = GroupKFold(n_splits=2)

group_kfold.get_n_splits(X, y, groups)

print(group_kfold)

for train_index, test_index in group_kfold.split(X, y, groups):

print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

print(X_train, X_test, y_train, y_test)

o/p

`GroupKFold(n_splits=2)`

('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

(array([[1, 2],

[3, 4]]), array([[5, 6],

[7, 8]]), array([1, 2]), array([3, 4]))

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))

(array([[5, 6],

[7, 8]]), array([[1, 2],

[3, 4]]), array([3, 4]), array([1, 2]))

GroupKFold(n_splits=2)

('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

(array([[1, 2],

[3, 4]]), array([[5, 6],

[7, 8]]), array([1, 2]), array([3, 4]))

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))

(array([[5, 6],

[7, 8]]), array([[1, 2],

[3, 4]]), array([3, 4]), array([1, 2]))

etc ...

The splits are identical.

How do I set a

`random_state`

`GroupKFold`

Eg, I want

`GroupKFold(n_splits=2, random_state=42)`

('TRAIN:', array([0, 1]),

'TEST:', array([2, 3]))

('TRAIN:', array([2, 3]),

'TEST:', array([0, 1]))

GroupKFold(n_splits=2, random_state=13)

('TRAIN:', array([0, 2]),

'TEST:', array([1, 3]))

('TRAIN:', array([1, 3]),

'TEST:', array([0, 2]))

So far, it seems a good strategy is to use a

`sklearn.utils.shuffle`

Answer Source

`KFold`

is only randomized if`shuffle=True`

. Some datasets should not be shuffled.`GroupKFold`

is not randomized at all. Hence the`random_state=None`

.`GroupShuffleSplit`

may be closer to what you're looking for.

A comparison of the group-based splitters:

- In
`GroupKFold`

, the test sets form a complete partition of all the data. `LeavePGroupsOut`

leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means`P ** n_groups`

splits altogether, often you want a small P, and most often want`LeaveOneGroupOut`

which is basically the same as`GroupKFold`

with`k=1`

.`GroupShuffleSplit`

makes no statement about the relationship between successive test sets; each train/test split is performed independently.

As an aside,
Dmytro Lituiev has proposed an alternative `GroupShuffleSplit`

algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified `test_size`

.