user6656013 user6656013 - 9 months ago 68
Python Question

linear regression model prediction in scikit-learn is inconsistent

So I built a simple linear regression model with a handful of features. When I try to predict for new input, the output is inconsistent. For example:

In [1]: model.predict(X_new)
Out[1]: array([ 7.15993216e+08, 1.13548305e+09])

But if I tack it onto the original training sample, I get a very different answer:

In [2]: model.predict(X_training[:1].append(X_new))[1:]
Out[2]: array([ 272682.59925699, 1179906.89475647])

This seems to be model agnostic (at least within linear regression). I also tried the same inside of a pipeline and get the sam behavior.

Any thoughts?


This seems to be an issue with the sorting order of the pandas data frame. A solution for this is to pre-sort both training and testing data sets by the same column order. Something along the lines of:

This cements the column order in the training and testing arrays.