user99889 - 2 months ago 32

Python Question

In the

`GroupKFold`

`random_state`

`None`

`def __init__(self, n_splits=3):`

super(GroupKFold, self).__init__(n_splits, shuffle=False,

random_state=None)

However, when run multiple times (code from here)

`import numpy as np`

from sklearn.model_selection import GroupKFold

for i in range(0,10):

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

y = np.array([1, 2, 3, 4])

groups = np.array([0, 0, 2, 2])

group_kfold = GroupKFold(n_splits=2)

group_kfold.get_n_splits(X, y, groups)

print(group_kfold)

for train_index, test_index in group_kfold.split(X, y, groups):

print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

print(X_train, X_test, y_train, y_test)

o/p

`GroupKFold(n_splits=2)`

('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

(array([[1, 2],

[3, 4]]), array([[5, 6],

[7, 8]]), array([1, 2]), array([3, 4]))

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))

(array([[5, 6],

[7, 8]]), array([[1, 2],

[3, 4]]), array([3, 4]), array([1, 2]))

GroupKFold(n_splits=2)

('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

(array([[1, 2],

[3, 4]]), array([[5, 6],

[7, 8]]), array([1, 2]), array([3, 4]))

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))

(array([[5, 6],

[7, 8]]), array([[1, 2],

[3, 4]]), array([3, 4]), array([1, 2]))

etc ...

Why is it that the splits are identical even though the

`random_state`

`None`

Further, when I use different

`random_state`

`Kfold`

`random_state`

`from sklearn.model_selection import KFold`

import sklearn

import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([1, 2, 3, 4])

kf = KFold(n_splits=2)

kf.get_n_splits(X)

for train_index, test_index in kf.split(X):

print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

print '**'

from sklearn.model_selection import KFold

import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([1, 2, 3, 4])

kf = KFold(n_splits=2)

kf.get_n_splits(X)

for train_index, test_index in kf.split(X):

print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

I get the same splits

`('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))`

('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

**

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))

('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

How do I set the

`random_state`

`Kfold`

`GroupKFold`

In the meantime:

`import numpy as np`

class GroupShuffler:

'''

gs = GroupShuffler(random_state=11, verbose=10)

gs.shuffle(X, y, groups)

'''

def __init__(self, random_state=None, verbose=0):

self.random_state = random_state

self.verbose = verbose

def shuffle(self, X, y, groups):

index = range(len(groups))

rng = np.random.RandomState(self.random_state)

rng.shuffle(index)

if self.verbose > 0:

print "shuffled index", index

return X[index], y[index], groups[index]

Answer Source

`KFold`

is only randomized if`shuffle=True`

. Some datasets should not be shuffled.`GroupKFold`

is not randomized at all. Hence the`random_state=None`

.`GroupShuffleSplit`

may be closer to what you're looking for.

A comparison of the group-based splitters:

- In
`GroupKFold`

, the test sets form a complete partition of all the data. `LeavePGroupsOut`

leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means`P ** n_groups`

splits altogether, often you want a small P, and most often want`LeaveOneGroupOut`

which is basically the same as`GroupKFold`

with`k=1`

.`GroupShuffleSplit`

makes no statement about the relationship between successive test sets; each train/test split is performed independently.

As an aside,
Dmytro Lituiev has proposed an alternative `GroupShuffleSplit`

algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified `test_size`

.