Vijay_Shinde Vijay_Shinde - 5 days ago 6
Java Question

Calculate Linear regression on data set in Map Reduce

Say I have a input as follows:

60,3.1

61,3.6

62,3.8

63,4

65,4.1


Ouput is expected as follows:

Expected output: y = -8.098 + 0.19x.

I know how to do this in java. But don't know how this work with mapreduce model. Can any one give idea or sample Map Reduce code on this problem? I will appreciate this.

This simple mathematical example:

Regression Formula:
Regression Equation(y) = a + bx
Slope(b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)
Intercept(a) = (ΣY - b(ΣX)) / N

where
x and y are the variables.
b = The slope of the regression line
a = The intercept point of the regression line and the y axis.
N = Number of values or elements
X = First Score
Y = Second Score
ΣXY = Sum of the product of first and Second Scores
ΣX = Sum of First Scores
ΣY = Sum of Second Scores
ΣX2 = Sum of square First Scores


e.g.

X Values Y Values
60 3.1
61 3.6
62 3.8
63 4
65 4.1


To find regression equation, we will first find slope, intercept and use it to form regression equation..

Step 1: Count the number of values.
N = 5

Step 2: Find XY, X2
See the below table

X Value Y Value X*Y X*X
60 3.1 60 * 3.1 = 186 60 * 60 = 3600
61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721
62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844
63 4 63 * 4 = 252 63 * 63 = 3969
65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225



Step 3: Find ΣX, ΣY, ΣXY, ΣX2.
ΣX = 311
ΣY = 18.6
ΣXY = 1159.7
ΣX2 = 19359

Step 4: Substitute in the above slope formula given.
Slope(b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)
= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)
= (5798.5 - 5784.6)/(96795 - 96721)
= 13.9/74
= 0.19

Step 5: Now, again substitute in the above intercept formula given.
Intercept(a) = (ΣY - b(ΣX)) / N
= (18.6 - 0.19(311))/5
= (18.6 - 59.09)/5
= -40.49/5
= -8.098

Step 6: Then substitute these values in regression equation formula
Regression Equation(y) = a + bx
= -8.098 + 0.19x.


Suppose if we want to know the approximate y value for the variable x = 64. Then we can substitute the value in the above equation.

Regression Equation(y) = a + bx
= -8.098 + 0.19(64).
= -8.098 + 12.16
= 4.06

Answer

I coded Map-Reduce implementation of linear regression with mrjob here:

https://github.com/AmazaspShumik/MapReduce-Machine-Learning/blob/master/Linear%20Regression%20MapReduce/LinearRegressionTS.py

Two most important components for getting parameters of linear regression are X.t*X (where X is input matrix where each row is observation) and X.t * Y. Each Mapper calculates X_i.t * X_i and X_i.t * Y_i for i-th part of data that it processes, then all mappers output data to single reducer. Reducer sum results of all mappers: X.t * X = Σi (X_i.t * X_i) X.t * Y = Σi (X_i.t * Y_i)

Then parameter vector b = (X.t * X)^-1 * X.t * Y However better solution for parameter identification will be to use Cholesky decomposition as it is done in code.

Comments