user3294779 - 8 months ago 46

Python Question

I have the following OLS model from StatsModels:

`X = df['Grade']`

y = df['Results']

X = statsmodels.tools.tools.add_constant(X)

mod = sm.OLS(y,X)

results = mod.fit()

When trying to predict a new Y value for an X value of 4, I have to pass the following:

`results.predict([1,4])`

I don't understand why an array with the first value being '1' needs to be passed in order for the predict function to work correctly. Why do I need to include a 1 instead of just saying:

`results.predict([4])`

I'm not clear on the concept at work here. Does anybody know what's going on?

Answer

You are adding a constant to the regression equation with `X = statsmodels.tools.tools.add_constant(X)`

. So your regressor X has two columns where the first column is a array of ones.

You need to do the same with the regressor that is used in prediction. So, the `1`

means include the constant in the prediction. If you use zero instead, then the contribution of the constant (`0 * params[0]`

) is zero and the prediction is only the slope effect.

The formula interface adds the constant automatically both for the regressor in the model and for the regressor in the prediction. However, with the pandas DataFrame or numpy ndarray interface, the constant needs to be added by the user both for the model and for predict.

Source (Stackoverflow)