ibarant - 2 years ago 843

Python Question

This is my first posted question, so please excuse if it doesn't look good.

I have a source data file which I transform to the following dataframe using pandas groupby aggregation

`pd.read_csv('R:/Python ETL/AGG7.csv', sep=',')`

Treatment Month stdev n avg

0 AAAA 1/1/2016 1.92 309 7.57

1 AAAA 2/1/2016 1.89 79 7.46

2 AAAA 3/1/2016 2.25 158 7.20

3 AAAA 4/1/2016 2.23 22 7.68

4 BBBB 1/1/2016 2.04 175 7.10

5 BBBB 2/1/2016 1.96 33 7.09

6 BBBB 3/1/2016 2.02 110 7.32

7 BBBB 4/1/2016 1.73 25 7.92

8 CCCC 1/1/2016 2.42 111 7.40

9 CCCC 2/1/2016 1.45 22 7.73

10 CCCC 3/1/2016 2.44 21 6.95

11 CCCC 4/1/2016 2.84 92 6.92

What I need is 2 additional columns with month over month difference (MoM diff) and p-value of T-tests of those differences.

`MoM diff pValue`

-0.11 0.35

-0.26 0.62

0.48 0.65

-0.01 0.02

0.23 0.44

0.6 0.83

0.33 0.46

-0.78 0.79

-0.03 0.04

The problem is that I cannot get them on the fly using pandas group by with scipy.stats ttest_ind function from original dataset and ttest_ind_from_stats function from the shown aggregated dataframe. I tried many different approaches, but with no success. Can anyone help, please?

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

You can use df.shift with groupby to have the shifted values:

```
df[["avg_2", "n_2", "stdev_2"]] = df.groupby("Treatment")["avg", "n", "stdev"].shift()
df
Out[7]:
Treatment Month stdev n avg avg_2 n_2 stdev_2
0 AAAA 2016-01-01 1.92 309 7.57 NaN NaN NaN
1 AAAA 2016-01-02 1.89 79 7.46 7.57 309.0 1.92
2 AAAA 2016-01-03 2.25 158 7.20 7.46 79.0 1.89
3 AAAA 2016-01-04 2.23 22 7.68 7.20 158.0 2.25
4 BBBB 2016-01-01 2.04 175 7.10 NaN NaN NaN
5 BBBB 2016-01-02 1.96 33 7.09 7.10 175.0 2.04
6 BBBB 2016-01-03 2.02 110 7.32 7.09 33.0 1.96
7 BBBB 2016-01-04 1.73 25 7.92 7.32 110.0 2.02
8 CCCC 2016-01-01 2.42 111 7.40 NaN NaN NaN
9 CCCC 2016-01-02 1.45 22 7.73 7.40 111.0 2.42
10 CCCC 2016-01-03 2.44 21 6.95 7.73 22.0 1.45
11 CCCC 2016-01-04 2.84 92 6.92 6.95 21.0 2.44
```

You can filter out `NaN`

values with pd.notnull:

```
df2 = df[pd.notnull(df.avg_2)].copy()
```

And you can get the results of the t-tests with:

```
import scipy.stats as ss
res = ss.ttest_ind_from_stats(df2.avg, df2.stdev, df2.n, df2.avg_2, df2.stdev_2, df2.n_2, equal_var=False)
```

If you want the mean differences and p-values in this dataframe:

```
df2["dif_avg"] = df2.avg - df2.avg_2
df2["p_value"] = res.pvalue
Out[22]:
Month stdev n avg avg_2 n_2 stdev_2 dif_avg p_value
1 2016-01-02 1.89 79 7.46 7.57 309.0 1.92 -0.11 0.646226
2 2016-01-03 2.25 158 7.20 7.46 79.0 1.89 -0.26 0.350814
3 2016-01-04 2.23 22 7.68 7.20 158.0 2.25 0.48 0.353023
5 2016-01-02 1.96 33 7.09 7.10 175.0 2.04 -0.01 0.978808
6 2016-01-03 2.02 110 7.32 7.09 33.0 1.96 0.23 0.559625
7 2016-01-04 1.73 25 7.92 7.32 110.0 2.02 0.60 0.137527
9 2016-01-02 1.45 22 7.73 7.40 111.0 2.42 0.33 0.395806
10 2016-01-03 2.44 21 6.95 7.73 22.0 1.45 -0.78 0.214270
11 2016-01-04 2.84 92 6.92 6.95 21.0 2.44 -0.03 0.961019
```

Line-by-line:

```
import csv
import scipy.stats as ss
results = []
treatment1 = ""
with open('R:/Python ETL/AGG7.csv') as f:
reader = csv.reader(f)
next(reader, None)
for line in reader:
treatment2, stdev2, n2, avg2 = line[0], float(line[2]), int(line[3]), float(line[4])
if treatment2 == treatment1:
ttest_res = ss.ttest_ind_from_stats(avg1, stdev1, n1, avg2, stdev2, n2, equal_var=False)
results.append((avg2-avg1, ttest_res.pvalue))
treatment1, stdev1, n1, avg1 = treatment2, stdev2, n2, avg2
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**