user6656013 user6656013 - 2 months ago 19
Python Question

linear regression model prediction in scikit-learn is inconsistent

So I built a simple linear regression model with a handful of features. When I try to predict for new input, the output is inconsistent. For example:

In [1]: model.predict(X_new)
Out[1]: array([ 7.15993216e+08, 1.13548305e+09])


But if I tack it onto the original training sample, I get a very different answer:

In [2]: model.predict(X_training[:1].append(X_new))[1:]
Out[2]: array([ 272682.59925699, 1179906.89475647])


This seems to be model agnostic (at least within linear regression). I also tried the same inside of a pipeline and get the sam behavior.

Any thoughts?

Answer

This seems to be an issue with the sorting order of the pandas data frame. A solution for this is to pre-sort both training and testing data sets by the same column order. Something along the lines of:

model.fit(np.array(X_training.sort_index(1)))
model.predict(np.array(new_input.sort_index(1)))

This cements the column order in the training and testing arrays.

Comments