shantanuo - 3 months ago 27

Python Question

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.

`from StringIO import StringIO`

myst="""01/01/2016 8781262

01/02/2016 8958598

01/03/2016 8787628

01/04/2016 9770861

01/05/2016 8409410

01/06/2016 8924784

01/07/2016 8597500

01/08/2016 6436862

01/09/2016 1542677

"""

u_cols=['month', 'count']

myf = StringIO(myst)

import pandas as pd

df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)

Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?

This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.

Answer

A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a `zscore`

which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.

Define a zscore function

```
def zscore(s):
return (s - np.mean(s)) / np.std(s)
```

Apply it to the `count`

column

```
zscore(df['count'])
0 0.414005
1 0.488906
2 0.416694
3 0.831981
4 0.256946
5 0.474624
6 0.336390
7 -0.576197
8 -2.643349
Name: count, dtype: float64
```

Notice that the September observation is 2.6 standard deviations away.

Use `abs`

and `gt`

to identify outliers

```
zscore(df['count']).abs().gt(2)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
Name: count, dtype: bool
```

Again, September comes back true.

Tie it all together to filter your original dataframe

```
df[zscore(df['count']).abs().gt(2)]
```

filter the other way

```
df[zscore(df['count']).abs().le(2)]
```