user6656013 - 7 months ago 62

Python Question

So I built a simple linear regression model with a handful of features. When I try to predict for new input, the output is inconsistent. For example:

`In [1]: model.predict(X_new)`

Out[1]: array([ 7.15993216e+08, 1.13548305e+09])

But if I tack it onto the original training sample, I get a very different answer:

`In [2]: model.predict(X_training[:1].append(X_new))[1:]`

Out[2]: array([ 272682.59925699, 1179906.89475647])

This seems to be model agnostic (at least within linear regression). I also tried the same inside of a pipeline and get the sam behavior.

Any thoughts?

Answer

This seems to be an issue with the sorting order of the pandas data frame. A solution for this is to pre-sort both training and testing data sets by the same column order. Something along the lines of:

```
model.fit(np.array(X_training.sort_index(1)))
model.predict(np.array(new_input.sort_index(1)))
```

This cements the column order in the training and testing arrays.