omrsin omrsin - 3 months ago 46
Scala Question

Linear regression with Spark MLlib only returns monotonic predictions

Check the update at the bottom of the question

Summary: I have a dataset that does not behave linearly. I am trying to use Spark's MLlib(v1.5.2) to fit a model that behaves more as a polynomial function but I always get a linear model as a result. I don't know if it's not possible to obtain a non-linear model using a linear regression.

[TL;DR] I am trying to fit a model that represents sufficiently good the following data:

enter image description here

My code is very simple (pretty much like in every tutorial)

object LinearRegressionTest {

def main(args: Array[String]): Unit = {
val sc = new SparkContext("local[2]", "Linear Regression")
val data = sc.textFile("data2.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(1).toDouble, Vectors.dense(parts(2).toDouble))
}.cache()

val numIterations = 1000
val stepSize = 0.001

val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
sc.stop
}
}


The obtained results are in the right range however they are always in a monotonically increasing line. I am trying to wrap my head around it but I cannot figure it out why a better curve is not being fitted.

Any tips?

Thanks everyone

Update
The problem was caused by the version of the spark and spark-ml libraries that we were using. For some reason, version 1.5.2 was not fitting a better curve even though I provided more features (squared or cubic versions of the input data). After upgrading to version 2.0.0 and switching from the deprecated
LinearRegressionWithSGD
to
LinearRegression
of the main API (not the RDD API). With this new method the model fitted the right curve.

Answer

There is nothing unexpected here. You use linear model of form

Y = βx + ε

so fitted result will always form a line going through origin (unlike for example R, Spark by default doesn't fit intercept) and as long as the model is at least marginally sane it should be increasing to approximate distribution of data.

While details are probably off topic on StackOverflow you should start with adding more features. It should be obvious that decent approximation here has to be quadratic so let's illustrate that step-by-step. We'll start with a very rough approximation of your data:

y <- c(0.6, 0.6, 0.6, 0.6, 0.575, 0.55, 0.525, 0.475, 0.45, 0.40, 0.35, 0.30)
df <- data.frame(y=c(y, rev(y)), x=0:23)
plot(df$x, df$y)

enter image description here

Model created in Spark is more or less equivalent to:

lm1 <- lm(y ~ x + 0, df)
lines(df$x, predict(lm1, df), col='red')

enter image description here

Since it is clear that model passing trough origin is not a good let's try to add an intercept:

lm2 <- lm(y ~ x, df)
lines(df$x, predict(lm2, df), col='blue')

enter image description here

Finally we know we need to some non-linearity:

df$x2 <- df$x ** 2
lm3 <- lm(y ~ x + x2, df)
lines(df$x, predict(lm3, df), col='green')

enter image description here

Take away message here is:

  • use setIntercept(true) when creating model LinearRegressionModel,
  • add some non-linear features to the model.

    val x = arts(2).toDouble
    LabeledPoint(parts(1).toDouble, Vectors.dense(x, x*x))
    
Comments