Hooloovoo - 2 months ago 53

Python Question

I have a data set on N numbers that I want to test for normality.

I know scipy.stats has a kstest function

but there are no examples on how to use it and how to interpret the results.

Is anyone here familiar with it that can give me some advice?

According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value.

If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.

When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:

`import numpy as np`

from scipy.stats import kstest

mu,sigma = 0.07, 0.89

kstest(np.random.normal(mu,sigma,10000),'norm')

I get the following output:

(0.04957880905196102, 8.9249710700788814e-22)

The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!

Can someone understand and explain to me the discrepancy here?

(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)

Answer

Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.

The null hypothesis (`H0`

) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.

The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.

In other words, (with p-value ~8.9e-22) it is highly unlikely that `H0`

is true.

That is reasonable, since the means and std deviations don't match.

Compare your result with:

```
In [23]: import scipy.stats as stats
In [24]: stats.kstest(np.random.normal(0,1,10000),'norm')
Out[24]: (0.007038739782416259, 0.70477679457831155)
```

To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:

```
data=np.random.normal(mu,sigma,10000)
normed_data=(data-mu)/sigma
print(stats.kstest(normed_data,'norm'))
# (0.0085805670733036798, 0.45316245879609179)
```

**Warning:** (many thanks to user333700!!) If you don't know `mu`

and `sigma`

, estimating the parameters makes the p-value invalid:

```
import numpy as np
import scipy.stats as stats
mu = 0.3
sigma = 5
num_tests = 10**5
num_rejects = 0
alpha = 0.05
for i in xrange(num_tests):
data = np.random.normal(mu, sigma, 10000)
# normed_data = (data - mu) / sigma # this is okay
# 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)
normed_data = (data - data.mean()) / data.std() # this is NOT okay
# 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
D, pval = stats.kstest(normed_data, 'norm')
if pval < alpha:
num_rejects += 1
ratio = float(num_rejects) / num_tests
print('{}/{} = {:.2f} rejects at rejection level {}'.format(
num_rejects, num_tests, ratio, alpha))
```

prints

```
20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
```

which shows that `stats.kstest`

may not reject the expected number of null hypotheses
if the sample is normalized using the sample's mean and standard deviation

```
normed_data = (data - data.mean()) / data.std() # this is NOT okay
```