WhitneyChia WhitneyChia - 3 months ago 20
Python Question

NaNs suddenly appearing for sklearn KFolds

I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before?

y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']]
X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444)


This is what my X data looked like before KFolds:

variation length tempo
0 0.005144 1183.148118 135.999178
1 0.002595 720.165442 117.453835
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
5 0.001620 560.365714 151.999081
6 0.002513 763.377778 107.666016
7 0.009262 502.083628 99.384014
8 0.000610 500.017052 143.554688
9 0.000733 269.001723 117.453835


My Y data looks like this:

array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)


Now when I try to do the cross val:

kf = KFold(X_train.shape[0], n_folds=4, shuffle=True)

for train_index, val_index in kf:
cv_train_x = X_train.ix[train_index]
cv_val_x = X_train.ix[val_index]
cv_train_y = y_train[train_index]
cv_val_y = y_train[val_index]
print cv_train_x

logreg = LogisticRegression(C = .01)
logreg.fit(cv_train_x, cv_train_y)
pred = logreg.predict(cv_val_x)
print accuracy_score(cv_val_y, pred)


When I try to run this, I error out with the below error, so I add the print statement.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').


In my print statement, this is what it printed, some data became NaNs.

variation length tempo
0 NaN NaN NaN
1 NaN NaN NaN
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178


I'm sure I'm doing something wrong, any ideas? As always, thank you so much!

Answer

To solve use .iloc instead of .ix to index your pandas dataframe

for train_index, val_index in kf:
    cv_train_x = X_train.iloc[train_index]
    cv_val_x = X_train.iloc[val_index]
    cv_train_y = y_train[train_index]
    cv_val_y = y_train[val_index]
    print cv_train_x

    logreg = LogisticRegression(C = .01)
    logreg.fit(cv_train_x, cv_train_y)
    pred = logreg.predict(cv_val_x)
    print accuracy_score(cv_val_y, pred)

Indexing with ix is usually equivalent to using .loc which is label based indexing, not index based. While .loc works on X which has a nice integer based indexing/labeling, after cv split this rule is no longer there, you get something like:

        length       tempo  variation
4   509.931973  135.999178   0.001631
2   397.500952  112.347147   0.008146
7   502.083628   99.384014   0.009262
6   763.377778  107.666016   0.002513
5   560.365714  151.999081   0.001620
3  1109.819501  172.265625   0.005367
9   269.001723  117.453835   0.000733

and now you no longer have label 0 or 1, so if you do

X_train.loc[1]

you will get an Exception

KeyError: 'the label [1] is not in the [index]'

However, pandas has a silent error if you request multiple labels, where at least one exists. Thus if you do

 X_train.loc[[1,4]]

you will get

       length       tempo  variation
1         NaN         NaN        NaN
4  509.931973  135.999178   0.001631

As expected - 1 returns NaNs (since it was not found) and 4 represents actual row - since it is inside X_train. In order to solve it - just switch to .iloc or manually rebuild an index of X_train.