Chris F. Chris F. - 2 months ago 44
Python Question

Stratified Labeled K-Fold Cross-Validation In Scikit-Learn

I'm trying to classify instances of a dataset as being in one of two classes, a or b. B is a minority class and only makes up 8% of the dataset. All instances are assigned an id indicating which subject generated the data. Because every subject generated multiple instances id's are repeated frequently in the dataset.

The table below is just an example, the real table has about 100,000 instances. Each subject id has about 100 instances in the table. Every subject is tied to exactly one class as you can see with "larry" below.

* field * field * id * class
*******************************************
0 * _ * _ * bob * a
1 * _ * _ * susan * a
2 * _ * _ * susan * a
3 * _ * _ * bob * a
4 * _ * _ * larry * b
5 * _ * _ * greg * a
6 * _ * _ * larry * b
7 * _ * _ * bob * a
8 * _ * _ * susan * a
9 * _ * _ * susan * a
10 * _ * _ * bob * a
11 * _ * _ * greg * a
... ... ... ... ...


I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject.

I'm using python's scikit-learn library. I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. How can I accomplish the above using scikit-learn? If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries?

Answer

The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple.

Suppose you make a dummy dataset where the independent variables are only id and class. Furthermore, in this dataset, remove duplicate id entries.

For your cross validation, run stratified cross validation on the dummy dataset. At each iteration:

  1. Find out which ids were selected for the train and the test

  2. Go back to the original dataset, and insert all the instances belonging to id as necessary into train and test sets.

This works because:

  1. As you stated, each id is associated with a single label.

  2. Since we run stratified CV, each class is represented proportionally.

  3. Since each id appears only in the train or test set (but not both), it is labeled too.

Comments