Anil Narassiguin Anil Narassiguin - 9 months ago 105
Python Question

Scikit: calculate precision and recall using cross_val_score function

I'm using scikit to perform a logistic regression on spam/ham data.
X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:

classifier = LogisticRegression(), y_train)

If I want to get the accuracies for a 10 fold cross validation, I just write:

accuracy = cross_val_score(classifier, X_train, y_train, cv=10)

I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way:

precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')

But it results in a

ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')

Is it related to the data (should I binarize the labels ?) or do they change the
function ?

Thank you in advance !


To compute the recall and precision, the data has to be indeed binarized, this way:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()

To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy:

accuracy = cross_val_score(classifier, X_train, y_train, cv=10)

It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.