caposeidon caposeidon - 1 year ago 140
Python Question

How to adjust scaled scikit-learn Logicistic Regression coeffs to score a non-scaled dataset?

I am currently using Scikit-Learn's LogisticRegression to build a model. I have used

from sklearn import preprocessing
scaler=preprocessing.StandardScaler().fit(build)
build_scaled = scaler.transform(build)


to scale all of my input variables prior to training the model. Everything works fine and produces a decent model, but my understanding is the coefficients produced by LogisticRegression.coeff_ are based on the scaled variables. Is there a transformation to those coefficients that can be used to adjust them to produce coefficients that can be applied to the non-scaled data?

I am thinking forward to am implementation of the model in a productionized system, and attempting to determine if all of the variables need to be pre-processed in some way in production for scoring of the model.

Note: the model will likely have to be re-coded within the production environment and the environment is not using python.

Answer Source

Short answer, to get LogisticRegression coefficients and intercept for unscaled data (assuming binary classification, and lr is a trained LogisticRegression object):

  1. you must divide your coefficient array element wise by the (since v0.17) scaler.scale_ array: coefficients = np.true_divide(lr.coeff_, scaler.scale_)

  2. you must subtract from your intercept the inner product of the resulting coefficients (the division result) array with the scaler.mean_ array: intercept = lr.intercept_ - np.dot(coefficients, scaler.mean_)

you can see why the above needs to be done, if you think that every feature is normalized by substracting from it its mean (stored in the scaler.mean_ array) and then dividing it by its standard deviation (stored in the scaler.scale_ array).