shantanuo - 1 year ago 96
Python Question

# How to find values below (or above) average

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.

``````from StringIO import StringIO

myst="""01/01/2016  8781262
01/02/2016  8958598
01/03/2016  8787628
01/04/2016  9770861
01/05/2016  8409410
01/06/2016  8924784
01/07/2016  8597500
01/08/2016  6436862
01/09/2016  1542677
"""
u_cols=['month', 'count']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)
``````

Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?

This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.

A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a `zscore` which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.

Define a zscore function

``````def zscore(s):
return (s - np.mean(s)) / np.std(s)
``````

Apply it to the `count` column

``````zscore(df['count'])

0    0.414005
1    0.488906
2    0.416694
3    0.831981
4    0.256946
5    0.474624
6    0.336390
7   -0.576197
8   -2.643349
Name: count, dtype: float64
``````

Notice that the September observation is 2.6 standard deviations away.

Use `abs` and `gt` to identify outliers

``````zscore(df['count']).abs().gt(2)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
Name: count, dtype: bool
``````

Again, September comes back true.

Tie it all together to filter your original dataframe

``````df[zscore(df['count']).abs().gt(2)]
``````

filter the other way

``````df[zscore(df['count']).abs().le(2)]
``````

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download