caposeidon caposeidon - 3 months ago 43
Python Question

How to adjust scaled scikit-learn Logicistic Regression coeffs to score a non-scaled dataset?

I am currently using Scikit-Learn's LogisticRegression to build a model. I have used

from sklearn import preprocessing
scaler=preprocessing.StandardScaler().fit(build)
build_scaled = scaler.transform(build)


to scale all of my input variables prior to training the model. Everything works fine and produces a decent model, but my understanding is the coefficients produced by LogisticRegression.coeff_ are based on the scaled variables. Is there a transformation to those coefficients that can be used to adjust them to produce coefficients that can be applied to the non-scaled data?

I am thinking forward to am implementation of the model in a productionized system, and attempting to determine if all of the variables need to be pre-processed in some way in production for scoring of the model.

Note: the model will likely have to be re-coded within the production environment and the environment is not using python.

Answer

Short answer, to get LogisticRegression coefficients and intercept for unscaled data (assuming binary classification, and lr is a trained LogisticRegression object):

  1. you must divide your coefficient array element wise by the (since v0.17) scaler.scale_ array: coefficients = np.true_divide(lr.coeff_, scaler.scale_)

  2. you must subtract from your intercept the inner product of the resulting coefficients (the division result) array with the scaler.mean_ array: intercept = lr.intercept_ - np.dot(coefficients, scaler.mean_)

you can see why the above needs to be done, if you think that every feature is normalized by substracting from it its mean (stored in the scaler.mean_ array) and then dividing it by its standard deviation (stored in the scaler.scale_ array).

Comments