Simon Simon - 1 month ago 18
Python Question

sklearn - model keeps overfitting

I'm looking for recommendations as to the best way forward for my current machine learning problem

The outline of the problem and what I've done is as follows:


  • I have 900+ trials of EEG data, where each trial is 1 second long. The ground truth is known for each and classifies state 0 and state 1 (40-60% split)

  • Each trial goes through preprocessing where I filter and extract power of certain frequency bands, and these make up a set of features (feature matrix: 913x32)

  • Then I use sklearn to train the model. cross_validation is used where I use a test size of 0.2. Classifier is set to SVC with rbf kernel, C = 1, gamma = 1 (I've tried a number of different values)



You can find a shortened version of the code here: http://pastebin.com/Xu13ciL4

My issues:


  • When I use the classifier to predict labels for my test set, every prediction is 0

  • train accuracy is 1, while test set accuracy is around 0.56

  • my learning curve plot looks like this:



enter image description here

Now, this seems like a classic case of overfitting here. However, overfitting here is unlikely to be caused by a disproportionate number of features to samples (32 features, 900 samples). I've tried a number of things to alleviate this problem:


  • I've tried using dimensionality reduction (PCA) in case it is because I have too many features for the number of samples, but accuracy scores and learning curve plot looks the same as above. Unless I set the number of components to below 10, at which point train accuracy begins to drop, but is this not somewhat expected given you're beginning to lose information?

  • I have tried normalizing and standardizing the data. Standardizing (SD = 1) does nothing to change train or accuracy scores. Normalizing (0-1) drops my training accuracy to 0.6.

  • I've tried a variety of C and gamma settings for SVC, but they don't change either score

  • Tried using other estimators like GaussianNB, even ensemble methods like adaboost. No change

  • Tried explcitly setting a regularization method using linearSVC but didn't improve the situation

  • I tried running the same features through a neural net using theano and my train accuracy is around 0.6, test is around 0.5



I'm happy to keep thinking about the problem but at this point I'm looking for a nudge in the right direction. Where might my problem be and what could I do to solve it?

It's entirely possible that my set of features just don't distinguish between the 2 categories, but I'd like to try some other options before jumping to this conclusion. Furthermore, if my features don't distinguish then that would explain the low test set scores, but how do you get a perfect training set score in that case? Is that possible?

Answer

I would first try a grid search over the parameter space but while also using a k-fold cross-validation on training set (and keeping the test set to the side of course). Then pick the set of parameters than generalize the best from the k-fold cross validation. I suggest using GridSearchCV with StratifiedKFold (it's already the default strategy for GridSearchCV when passing a classifier as estimator).

Hypothetically an SVM with rbf can perfectly fit any training set as VC dimension is infinite. So if tuning the parameters doesn't help reduce overfitting then you may want to try a similar parameter tuning strategy for a simpler hypothesis such as a linear SVM or another classifier you think may be appropriate for your domain.

Regularization as you mentioned is definitely a good idea if its available.

The prediction of the same label makes me think that label imbalance may be an issue and for this case you could use different class weights. So in the case of an SVM each class gets its own C penalty weight. Some estimators in sklearn accept fit params that allow you to set a sample weights to set the amount of penalty for individual training samples.

Now if you think the features may be an issue I would use feature selection by looking at F-values provided by f_classif and could be use with something like SelectKBest. Another option would be recursive feature elimination with cross validation. Feature selection can be wrapped into a grid search as well if you use sklearns Pipeline API.

Comments