Vijay_Shinde - 1 year ago 128

Java Question

Say I have a input as follows:

`60,3.1`

61,3.6

62,3.8

63,4

65,4.1

Ouput is expected as follows:

Expected output: y = -8.098 + 0.19x.

I know how to do this in java. But don't know how this work with mapreduce model. Can any one give idea or sample Map Reduce code on this problem? I will appreciate this.

This simple mathematical example:

`Regression Formula:`

Regression Equation(y) = a + bx

Slope(b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)

Intercept(a) = (ΣY - b(ΣX)) / N

where

x and y are the variables.

b = The slope of the regression line

a = The intercept point of the regression line and the y axis.

N = Number of values or elements

X = First Score

Y = Second Score

ΣXY = Sum of the product of first and Second Scores

ΣX = Sum of First Scores

ΣY = Sum of Second Scores

ΣX2 = Sum of square First Scores

e.g.

`X Values Y Values`

60 3.1

61 3.6

62 3.8

63 4

65 4.1

To find regression equation, we will first find slope, intercept and use it to form regression equation..

`Step 1: Count the number of values.`

N = 5

Step 2: Find XY, X2

See the below table

X Value Y Value X*Y X*X

60 3.1 60 * 3.1 = 186 60 * 60 = 3600

61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721

62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844

63 4 63 * 4 = 252 63 * 63 = 3969

65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225

Step 3: Find ΣX, ΣY, ΣXY, ΣX2.

ΣX = 311

ΣY = 18.6

ΣXY = 1159.7

ΣX2 = 19359

Step 4: Substitute in the above slope formula given.

Slope(b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)

= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)

= (5798.5 - 5784.6)/(96795 - 96721)

= 13.9/74

= 0.19

Step 5: Now, again substitute in the above intercept formula given.

Intercept(a) = (ΣY - b(ΣX)) / N

= (18.6 - 0.19(311))/5

= (18.6 - 59.09)/5

= -40.49/5

= -8.098

Step 6: Then substitute these values in regression equation formula

Regression Equation(y) = a + bx

= -8.098 + 0.19x.

Suppose if we want to know the approximate y value for the variable x = 64. Then we can substitute the value in the above equation.

`Regression Equation(y) = a + bx`

= -8.098 + 0.19(64).

= -8.098 + 12.16

= 4.06

Answer Source

I coded Map-Reduce implementation of linear regression with mrjob here:

Two most important components for getting parameters of linear regression are X.t*X (where X is input matrix where each row is observation) and X.t * Y. Each Mapper calculates X_i.t * X_i and X_i.t * Y_i for i-th part of data that it processes, then all mappers output data to single reducer. Reducer sum results of all mappers: X.t * X = Σi (X_i.t * X_i) X.t * Y = Σi (X_i.t * Y_i)

Then parameter vector b = (X.t * X)^-1 * X.t * Y However better solution for parameter identification will be to use Cholesky decomposition as it is done in code.