ibarant ibarant - 7 months ago 311
Python Question

pandas: t-test and p-value of month over month mean difference in aggregated dataframe using groupby function

This is my first posted question, so please excuse if it doesn't look good.

I have a source data file which I transform to the following dataframe using pandas groupby aggregation

pd.read_csv('R:/Python ETL/AGG7.csv', sep=',')

Treatment Month stdev n avg
0 AAAA 1/1/2016 1.92 309 7.57
1 AAAA 2/1/2016 1.89 79 7.46
2 AAAA 3/1/2016 2.25 158 7.20
3 AAAA 4/1/2016 2.23 22 7.68
4 BBBB 1/1/2016 2.04 175 7.10
5 BBBB 2/1/2016 1.96 33 7.09
6 BBBB 3/1/2016 2.02 110 7.32
7 BBBB 4/1/2016 1.73 25 7.92
8 CCCC 1/1/2016 2.42 111 7.40
9 CCCC 2/1/2016 1.45 22 7.73
10 CCCC 3/1/2016 2.44 21 6.95
11 CCCC 4/1/2016 2.84 92 6.92


What I need is 2 additional columns with month over month difference (MoM diff) and p-value of T-tests of those differences.

MoM diff pValue
-0.11 0.35
-0.26 0.62
0.48 0.65

-0.01 0.02
0.23 0.44
0.6 0.83

0.33 0.46
-0.78 0.79
-0.03 0.04


The problem is that I cannot get them on the fly using pandas group by with scipy.stats ttest_ind function from original dataset and ttest_ind_from_stats function from the shown aggregated dataframe. I tried many different approaches, but with no success. Can anyone help, please?

Answer

You can use df.shift with groupby to have the shifted values:

df[["avg_2", "n_2", "stdev_2"]] = df.groupby("Treatment")["avg", "n", "stdev"].shift()

df
Out[7]: 
   Treatment      Month  stdev    n   avg  avg_2    n_2  stdev_2
0       AAAA 2016-01-01   1.92  309  7.57    NaN    NaN      NaN
1       AAAA 2016-01-02   1.89   79  7.46   7.57  309.0     1.92
2       AAAA 2016-01-03   2.25  158  7.20   7.46   79.0     1.89
3       AAAA 2016-01-04   2.23   22  7.68   7.20  158.0     2.25
4       BBBB 2016-01-01   2.04  175  7.10    NaN    NaN      NaN
5       BBBB 2016-01-02   1.96   33  7.09   7.10  175.0     2.04
6       BBBB 2016-01-03   2.02  110  7.32   7.09   33.0     1.96
7       BBBB 2016-01-04   1.73   25  7.92   7.32  110.0     2.02
8       CCCC 2016-01-01   2.42  111  7.40    NaN    NaN      NaN
9       CCCC 2016-01-02   1.45   22  7.73   7.40  111.0     2.42
10      CCCC 2016-01-03   2.44   21  6.95   7.73   22.0     1.45
11      CCCC 2016-01-04   2.84   92  6.92   6.95   21.0     2.44

You can filter out NaN values with pd.notnull:

df2 = df[pd.notnull(df.avg_2)].copy()

And you can get the results of the t-tests with:

import scipy.stats as ss
res = ss.ttest_ind_from_stats(df2.avg, df2.stdev, df2.n, df2.avg_2, df2.stdev_2, df2.n_2, equal_var=False)

If you want the mean differences and p-values in this dataframe:

df2["dif_avg"] = df2.avg - df2.avg_2
df2["p_value"] = res.pvalue


Out[22]: 
        Month  stdev    n   avg  avg_2    n_2  stdev_2  dif_avg   p_value
1  2016-01-02   1.89   79  7.46   7.57  309.0     1.92    -0.11  0.646226
2  2016-01-03   2.25  158  7.20   7.46   79.0     1.89    -0.26  0.350814
3  2016-01-04   2.23   22  7.68   7.20  158.0     2.25     0.48  0.353023
5  2016-01-02   1.96   33  7.09   7.10  175.0     2.04    -0.01  0.978808
6  2016-01-03   2.02  110  7.32   7.09   33.0     1.96     0.23  0.559625
7  2016-01-04   1.73   25  7.92   7.32  110.0     2.02     0.60  0.137527
9  2016-01-02   1.45   22  7.73   7.40  111.0     2.42     0.33  0.395806
10 2016-01-03   2.44   21  6.95   7.73   22.0     1.45    -0.78  0.214270
11 2016-01-04   2.84   92  6.92   6.95   21.0     2.44    -0.03  0.961019

Line-by-line:

import csv
import scipy.stats as ss
results = []
treatment1 = ""
with open('R:/Python ETL/AGG7.csv') as f:
    reader = csv.reader(f)
    next(reader, None)
    for line in reader:
        treatment2, stdev2, n2, avg2 = line[0], float(line[2]), int(line[3]), float(line[4])
        if treatment2 == treatment1:
            ttest_res = ss.ttest_ind_from_stats(avg1, stdev1, n1, avg2, stdev2, n2, equal_var=False)
            results.append((avg2-avg1, ttest_res.pvalue))
        treatment1, stdev1, n1, avg1 = treatment2, stdev2, n2, avg2
Comments