Carlos Andres Castro - 6 months ago 56

Python Question

I was doing a normality test in Python spark-ml and saw what I *think* is an bug.

Here is the setup, i have a data-set that is normalized (range -1, to 1).

When I do a histogram, i can clearly see that the data is NOT normal:

`>>> prices_norm.histogram(10)`

([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],

[226, 269, 119, 95, 52, 26, 8, 2, 2, 5])

When I run the Kolmgorov-Smirnov test I get the following results:

`>>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")`

>>> print testResults

Kolmogorov-Smirnov test summary:

degrees of freedom = 0

statistic = 0.46231145770077375

pValue = 1.742039845709087E-11

Very strong presumption against null hypothesis: Sample follows theoretical distribution.

The Kolmgorov-Smirnov test defines the

In this case the p-value is very low, so we should reject the null hypothesis. This makes sense, as it is clearly not normal.

So why then, does it say:

`Sample follows theoretical distribution`

Isn't this wrong? Shouldn't it say that the sample does NOT follow a theoretical distribution? Am I missing something?

Answer

This was driving me crazy, so I went to look at the source code directly:

```
git://git.apache.org/spark.git
spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala
```

The code is **correct**, the null Hypothesis is set as:

```
object NullHypothesis extends Enumeration {
type NullHypothesis = Value
val OneSampleTwoSided = Value("Sample follows theoretical distribution")
}
```

The verbiage of the string message is just **restating the null hypothesis**:

```
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
________________________________________
H0
```

Arguably the verbiage is confusing as it could be interpreted both ways. But it is indeed correct.