Jacek Dominiak - 10 months ago 65

Scala Question

I am starting with Spark MLib library with Scala. As per my tests so far, I can not get the results even remotely correct. Was trying several ways to accomplish it with no success. As per now, even with relatively simple data:

`1,1`

2,2

3,3

4,4

5,5

6,6

7,7

8,8

9,9

10,10

I am unable to come to any decent results. Here is my code so far: [fairly standard I guess]

`import org.apache.spark.mllib.regression.LinearRegressionWithSGD`

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile("/Users/jacek/oo.csv")

val parsedData = data.map { line =>

val parts = line.split(',')

LabeledPoint(parts(0).toDouble, Vectors.dense(Array(1.0, parts(1).toDouble)))

}

val numIterations = 20

val model = LinearRegressionWithSGD.train(parsedData, numIterations)

val valuesAndPreds = parsedData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Those are the results I am getting:

`model: org.apache.spark.mllib.regression.LinearRegressionModel = (weights=[-1.3423470408513295E21,-9.345181656001024E21], intercept=0.0)`

scala> parsedData.take(10)

res48: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((1.0,[1.0,1.0]), (2.0,[1.0,2.0]), (3.0,[1.0,3.0]), (4.0,[1.0,4.0]), (5.0,[1.0,5.0]), (6.0,[1.0,6.0]), (7.0,[1.0,7.0]), (8.0,[1.0,8.0]), (9.0,[1.0,9.0]), (10.0,[1.0,10.0]))

scala> valuesAndPreds.take(10)

res49: Array[(Double, Double)] = Array((1.0,-6.133210764535208E21), (2.0,-1.2266421529070415E22), (3.0,-1.8399632293605623E22), (4.0,-2.453284305814083E22), (5.0,-3.0666053822676038E22), (6.0,-3.6799264587211245E22), (7.0,-4.293247535174645E22), (8.0,-4.906568611628166E22), (9.0,-5.519889688081687E22), (10.0,-6.1332107645352076E22))

scala>

I've tried with different sets of LinearRegression algorithm settings without much luck.

Any help appreciated.

Answer Source

Based on some tests here is the regression optimiser settings which would make the numbers as good as they can get, I suppose:

```
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
regression.optimizer.setNumIterations(1000)
val model2 = regression.run(parsedData)
```

Thanks @pzecevic for your help. You've pointed me to the right direction.