S. Iqbal - 2 months ago 101

Python Question

So I am using a simple SGDClassifier on the MNIST dataset (as per the Hands-on ML book) and I can't seem to figure out the behavior of its decision_function.

I changed the last line of the original_decision function to specifically check if anything was different. The variables with suffix "check" are the ones returned by the decision_function here. The code:

`from sklearn.datasets import fetch_mldata`

mnist = fetch_mldata("MNIST original")

import numpy as np

from sklearn.linear_model import SGDClassifier

from sklearn.utils.extmath import safe_sparse_dot

from sklearn.utils import check_array

X, y = mnist["data"], mnist["target"]

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

shuffle_index = np.random.permutation(60000)

X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

# converting the problem into a binary classification problem.

y_train_5 = (y_train == 5)

y_test_5 = (y_test == 5)

sgd_clf = SGDClassifier(random_state=42)

sgd_clf.fit(X_train, y_train_5)

# modified decision_func to ouput vars

X_check, coef_check, intcpt_check = sgd_clf.decision_function([X_train[36000]])

print((X_check == X_train[36000]).all())

print((coef_check == sgd_clf.coef_).all())

print((intcpt_check == sgd_clf.intercept_).all())

# using same funcs as used by decision_function to calc.

# i.e. check_array safe_sparse_dot

X_mod = check_array(X[36000].reshape(1,-1), "csr")

my_score = safe_sparse_dot(X_mod,sgd_clf.coef_.T) + sgd_clf.intercept_

sk_score = safe_sparse_dot(X_check, coef_check.T) + intcpt_check

print(my_score)

print(sk_score)

Here is the output (for a single run):

`True`

True

True

[[ 49505.1725926]]

[[-347904.18757136]]

Here is the modification I made to the decision_function (2nd last line before the commented out original one):

`scores = safe_sparse_dot(X, self.coef_.T,`

dense_output=True) + self.intercept_

return X, self.coef_, self.intercept_

#return scores.ravel() if scores.shape[1] == 1 else scores

Even though the 3 entities involved (instance X, coefficients, intercept) all match to be true with my variables, the multiplication still leads to vastly different outcomes.

Edit: Curiously I found out that if I comment out the two lines responsible for shuffling the dataset namely:

`shuffle_index = np.random.permutation(60000)`

X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

The problem disappears...

Answer Source

Its probably due to X_mod using the original "X" (unshuffled) matrix while X_check uses the X_train(shuffled) index.

i.e. X_train[36000]!=X[36000] so of course, your scores should not be the same.