Rich Rich - 3 months ago 52
Python Question

Scikit-learn script giving vastly different results than the tutorial, and gives an error when I change the dataframes

I'm working through a tutorial that has this section:

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.linear_model.logistic import LogisticRegression
>>> from sklearn.cross_validation import train_test_split, cross_val_score
>>> df = pd.read_csv('data/sms.csv')
>>> X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
>>> vectorizer = TfidfVectorizer()
>>> X_train = vectorizer.fit_transform(X_train_raw)
>>> X_test = vectorizer.transform(X_test_raw)
>>> classifier = LogisticRegression()
>>> classifier.fit(X_train, y_train)
>>> precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
>>> print 'Precision', np.mean(precisions), precisions
>>> recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
>>> print 'Recalls', np.mean(recalls), recalls


Which I then copied with few modifications:

ddir = (sys.argv[1])
df = pd.read_csv(ddir + '/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE, names=["label", "message"])
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['label'], df['message'])


vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)


precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')


print 'Precision', np.mean(precisions), precisions
print 'Recalls', np.mean(recalls), recalls


However, despite there being next to no differences in the code, the results in the book are far better than mine:

Book:
Precision 0.992137651822 [ 0.98717949 0.98666667 1. 0.98684211 1. ]

Recall 0.677114261885 [ 0.7 0.67272727 0.6 0.68807339 0.72477064]


Mine:
Precision 0.108435683974 [ 2.33542342e-06 1.22271611e-03 1.68918919e-02 1.97530864e-01 3.26530612e-01]
Recalls 0.235220281632 [ 0.00152053 0.03370787 0.125 0.44444444 0.57142857]


Going back over the script to see what went wrong, I thought that line 18:

X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['label'], df['message'])


was the culprit, and changed
(df['label'], df['message'])
to
(df['message'], df['label'])
. But that gave me an error:

Traceback (most recent call last):
File "Chapter4[B-FLGTLG]C[Y-BCPM][G-PAR--[00].py", line 30, in <module>
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1433, in cross_val_score
for train, test in cv)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
self.results = batch()
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1550, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1606, in _score
score = scorer(estimator, X_test, y_test)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/scorer.py", line 90, in __call__
**self._kwargs)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 1203, in precision_score
sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 984, in precision_recall_fscore_support
(pos_label, present_labels))
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'],
dtype='|S4')


What could be the problem here? The data is here: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection in case anyone wants to try.

Answer

The error at the end of the stacktrace is the key to understand what's going on here.

ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')

You're trying to score your model with precision and recall. Recall that these scoring methods are formulated in terms of true positives, false positives, and false negatives. But how does sklearn know what is positive and what is negative? Is it 'ham' or 'spam'? We need a way to tell sklearn that we consider 'spam' the positive label and 'ham' the negative label. According to the sklearn documentation, the precision and recall scorers by default expect a positive label of 1, hence the pos_label=1 part of the error message.

There are at least 3 ways to go about fixing this.

1. Encode 'ham' and 'spam' values as 0 and 1 directly from the data source in order to accommodate the precision/recall scorers:

# Map dataframe to encode values and put values into a numpy array
encoded_labels = df['label'].map(lambda x: 1 if x == 'spam' else 0).values # ham will be 0 and spam will be 1

# Continue as normal
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], encoded_labels)

2. Use sklearn's built-in function (label_binarize) to transform the categorical data into encoded integers in order to accommodate precision/recall scorers:

This will transform your categorical data to integers.

# Encode labels
from sklearn.preprocessing import label_binarize
encoded_column_vector = label_binarize(df['label'], classes=['ham','spam']) # ham will be 0 and spam will be 1
encoded_labels = np.ravel(encoded_column_vector) # Reshape array

# Continue as normal
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], encoded_labels)

3. Create scorer objects with custom arguments for pos_label:

As the documentation says, the precision and recall scores by default have a pos_label argument of 1, but this can be changed to inform the scorer which string represents the positive label. You can construct scorer objects that have different arguments with make_scorer.

# Start out as you did originally with string labels
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
# Fit classifier as normal ...


# Get precision and recall
from sklearn.metrics import precision_score, recall_score, make_scorer
# Precision
precision_scorer = make_scorer(precision_score, pos_label='spam')
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring=precision_scorer)
print 'Precision', np.mean(precisions), precisions

# Recall
recall_scorer = make_scorer(recall_score, pos_label='spam')
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring=recall_scorer)
print 'Recalls', np.mean(recalls), recalls

After making any of these changes to your code, I'm getting average precision and recall scores of around 0.990 and 0.704, consistent with the book's numbers.

Of all the 3 options, I recommend #3 the most because it is harder to get wrong.

Comments