omrsin - 5 months ago 51

Scala Question

**Check the update at the bottom of the question**

Summary: I have a dataset that does not behave linearly. I am trying to use Spark's MLlib(v1.5.2) to fit a model that behaves more as a polynomial function but I always get a linear model as a result. I don't know if it's not possible to obtain a non-linear model using a linear regression.

[TL;DR] I am trying to fit a model that represents sufficiently good the following data:

My code is very simple (pretty much like in every tutorial)

`object LinearRegressionTest {`

def main(args: Array[String]): Unit = {

val sc = new SparkContext("local[2]", "Linear Regression")

val data = sc.textFile("data2.csv")

val parsedData = data.map { line =>

val parts = line.split(',')

LabeledPoint(parts(1).toDouble, Vectors.dense(parts(2).toDouble))

}.cache()

val numIterations = 1000

val stepSize = 0.001

val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

sc.stop

}

}

The obtained results are in the right range however they are always in a monotonically increasing line. I am trying to wrap my head around it but I cannot figure it out why a better curve is not being fitted.

Any tips?

Thanks everyone

The problem was caused by the version of the spark and spark-ml libraries that we were using. For some reason, version 1.5.2 was not fitting a better curve even though I provided more features (squared or cubic versions of the input data). After upgrading to version 2.0.0 and switching from the deprecated

`LinearRegressionWithSGD`

`LinearRegression`

Answer

There is nothing unexpected here. You use linear model of form

```
Y = βx + ε
```

so fitted result will always form a line going through origin (unlike for example R, Spark by default doesn't fit intercept) and as long as the model is at least marginally sane it should be increasing to approximate distribution of data.

While details are probably off topic on StackOverflow you should start with adding more features. It should be obvious that decent approximation here has to be quadratic so let's illustrate that step-by-step. We'll start with a very rough approximation of your data:

```
y <- c(0.6, 0.6, 0.6, 0.6, 0.575, 0.55, 0.525, 0.475, 0.45, 0.40, 0.35, 0.30)
df <- data.frame(y=c(y, rev(y)), x=0:23)
plot(df$x, df$y)
```

Model created in Spark is more or less equivalent to:

```
lm1 <- lm(y ~ x + 0, df)
lines(df$x, predict(lm1, df), col='red')
```

Since it is clear that model passing trough origin is not a good let's try to add an intercept:

```
lm2 <- lm(y ~ x, df)
lines(df$x, predict(lm2, df), col='blue')
```

Finally we know we need to some non-linearity:

```
df$x2 <- df$x ** 2
lm3 <- lm(y ~ x + x2, df)
lines(df$x, predict(lm3, df), col='green')
```

Take away message here is:

- use
`setIntercept(true)`

when creating model`LinearRegressionModel`

, add some non-linear features to the model.

`val x = arts(2).toDouble LabeledPoint(parts(1).toDouble, Vectors.dense(x, x*x))`