gary yong gary yong - 2 months ago 23
Python Question

I get a linear regression using the SVR by python scikit-learn when the data is not linear

train.sort_values(by=['mass'], ascending=True, inplace=True)
x = train['mass']
y = train['pa']

# Fit regression model
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr_lin = SVR(kernel='linear', C=1e3)
svr_poly = SVR(kernel='poly', C=1e3, degree=2)
x_train = x.reshape(x.shape[0], 1)
x = x_train
y_rbf = svr_rbf.fit(x, y).predict(x)
y_lin = svr_lin.fit(x, y).predict(x)
y_poly = svr_poly.fit(x, y).predict(x)

# look at the results
plt.scatter(x, y, c='k', label='data')
plt.hold('on')
plt.plot(x, y_rbf, c='g', label='RBF model')
plt.plot(x, y_lin, c='r', label='Linear model')
plt.plot(x, y_poly, c='b', label='Polynomial model')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()


The code is copied from http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html.
And what I change is only the dataset. I do not know what is the matter.

Answer

Most likely has to do with the scale of your data. You are using the same penalty hyper-parameter as they are in the example, but your y values are orders of magnitude greater. Thus, the SVR algorithm will favor simplicity over accuracy since your penalty for error is now small compared to your y values. You need to increase C to say 1e6 (or normalize your y values).

You can see that this is the case if you make C very small in their example code, say C=.00001. Then you get the same kind of results that you are getting in your code.

(More on the algorithm here.)

As a side note, a huge part of Machine Learning practice is hyper-parameter tuning. This is a good example of how even a good base model can yield bad results if provided with the wrong hyper-parameters.

Comments