Jacek Dominiak Jacek Dominiak - 1 month ago 15
Scala Question

Linear regression weights and prediction in spark

I am starting with Spark MLib library with Scala. As per my tests so far, I can not get the results even remotely correct. Was trying several ways to accomplish it with no success. As per now, even with relatively simple data:

1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
10,10


I am unable to come to any decent results. Here is my code so far: [fairly standard I guess]

import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile("/Users/jacek/oo.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(Array(1.0, parts(1).toDouble)))
}

val numIterations = 20
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}


Those are the results I am getting:

model: org.apache.spark.mllib.regression.LinearRegressionModel = (weights=[-1.3423470408513295E21,-9.345181656001024E21], intercept=0.0)

scala> parsedData.take(10)
res48: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((1.0,[1.0,1.0]), (2.0,[1.0,2.0]), (3.0,[1.0,3.0]), (4.0,[1.0,4.0]), (5.0,[1.0,5.0]), (6.0,[1.0,6.0]), (7.0,[1.0,7.0]), (8.0,[1.0,8.0]), (9.0,[1.0,9.0]), (10.0,[1.0,10.0]))

scala> valuesAndPreds.take(10)
res49: Array[(Double, Double)] = Array((1.0,-6.133210764535208E21), (2.0,-1.2266421529070415E22), (3.0,-1.8399632293605623E22), (4.0,-2.453284305814083E22), (5.0,-3.0666053822676038E22), (6.0,-3.6799264587211245E22), (7.0,-4.293247535174645E22), (8.0,-4.906568611628166E22), (9.0,-5.519889688081687E22), (10.0,-6.1332107645352076E22))

scala>


I've tried with different sets of LinearRegression algorithm settings without much luck.
Any help appreciated.

Answer

Based on some tests here is the regression optimiser settings which would make the numbers as good as they can get, I suppose:

var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
regression.optimizer.setNumIterations(1000)
val model2 = regression.run(parsedData)

Thanks @pzecevic for your help. You've pointed me to the right direction.

Comments