javadba javadba - 3 months ago 8
Scala Question

Why does MLLib GenerateLinearInput internally multiply variance by 12.0?

Consider the

generateLinearInput
method from MLLib
LinearDataGenerator
:

Here is the signature of the method:

def generateLinearInput(
intercept: Double,
weights: Array[Double],
xMean: Array[Double],
xVariance: Array[Double],
nPoints: Int,
seed: Int,
eps: Double): Seq[LabeledPoint] = {


and here is the core logic for generating the raw data points:

val rnd = new Random(seed)
val x = Array.fill[Array[Double]](nPoints)(
Array.fill[Double](weights.length)(rnd.nextDouble()))

x.foreach { v =>
var i = 0
val len = v.length
while (i < len) {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
i += 1
}


Notice in particular the
12.0
scaling factor
on the variance. What is the purpose of that factor?

For completeness: here is the remainder of that method - in which the input linear function is applied to the x/domain values to generate the output y/range values:

val y = x.map { xi =>
blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian()
}
y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2)))

Answer

If you have random variable X

enter image description here

then its variance is equal

enter image description here

So this piece of code

v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)

should be equivalent to:

enter image description here

where a' and b' are the parameters of the desired uniform distribution and EX' is mean of the desired distribution. If you set xMean to 0 the rest of the code centers input data around 0 and adjusts spread.

Comments