cancerconnector - 1 year ago 116
Python Question

# Is there a simple way to do all pairwise statistical comparisons of distributions from a pandas dataframe?

I have a pandas dataframe with 5 distributions in it. I am able to slice it up and run pairwise comparisons using ranksums or equivalent, like this:

``````case_1 = df[df['Symmetric Division Rate']=='1' and df['test']=='sackin']['value']
case_2 = df[df['Symmetric Division Rate']=='0.8' and df['test']=='sackin']['value']
case_3 = df[df['Symmetric Division Rate']=='0.6' and df['test']=='sackin']['value']
case_4 = df[df['Symmetric Division Rate']=='0.4' and df['test']=='sackin']['value']
case_5 = df[df['Symmetric Division Rate']=='0.2' and df['test']=='sackin']['value']

z_stat_12, p_val_12 = stats.ranksums(case1, case2)
z_stat_13, p_val_13 = stats.ranksums(case1, case3)
z_stat_14, p_val_14 = stats.ranksums(case1, case4)
z_stat_15, p_val_15 = stats.ranksums(case1, case5)
z_stat_23, p_val_23 = stats.ranksums(case2, case3)
z_stat_24, p_val_24 = stats.ranksums(case2, case4)
z_stat_25, p_val_25 = stats.ranksums(case2, case5)
z_stat_34, p_val_34 = stats.ranksums(case3, case4)
z_stat_35, p_val_35 = stats.ranksums(case3, case5)
z_stat_45, p_val_45 = stats.ranksums(case4, case5)
``````

and I get the numbers out I want, but this seems horribly unpython-like and I'm sure there is a simpler way using pandas.

here is an example dataset (i've never posted data before, sorry if this is unwieldy).

``````    SymmetricDivisionRate   iteration   test    value
0   1   1   B1  205.0345238
1   1   1   Nbar    3.24545051
2   1   1   sackin  7312
3   1   1   sackin_yule -11.34946052
4   1   1   sackin_pda  0.068374536
5   1   2   B1  216.1595238
6   1   2   Nbar    3.182567216
7   1   2   sackin  7339
8   1   2   sackin_yule -11.45883714
9   1   2   sackin_pda  0.066274725
10  1   3   B1  209.1
11  1   3   Nbar    3.110472824
12  1   3   sackin  7039
13  1   3   sackin_yule -11.49329366
14  1   3   sackin_pda  0.065385904
15  1   4   B1  209.5678571
16  1   4   Nbar    3.215731371
17  1   4   sackin  6991
18  1   4   sackin_yule -11.30780804
19  1   4   sackin_pda  0.068968375
20  1   5   B1  218.1789683
21  1   5   Nbar    3.248949089
22  1   5   sackin  6956
23  1   5   sackin_yule -11.24400585
24  1   5   sackin_pda  0.070215755
25  0.8 1   B1  109.5333333
26  0.8 1   Nbar    2.789264414
27  0.8 1   sackin  4209
28  0.8 1   sackin_yule -11.00423445
29  0.8 1   sackin_pda  0.071803409
30  0.8 2   B1  137.5761905
31  0.8 2   Nbar    3.071715818
32  0.8 2   sackin  4583
33  0.8 2   sackin_yule -10.69913124
34  0.8 2   sackin_pda  0.079523708
35  0.8 3   B1  125.0428571
36  0.8 3   Nbar    3.630173565
37  0.8 3   sackin  5438
38  0.8 3   sackin_yule -10.14869758
39  0.8 3   sackin_pda  0.093793228
40  0.8 4   B1  119.45
41  0.8 4   Nbar    3.045751634
42  0.8 4   sackin  4660
43  0.8 4   sackin_yule -10.77537925
44  0.8 4   sackin_pda  0.077866162
45  0.8 5   B1  134.9511905
46  0.8 5   Nbar    3.507385999
47  0.8 5   sackin  5461
48  0.8 5   sackin_yule -10.34871987
49  0.8 5   sackin_pda  0.088887207
50  0.6 1   B1  113.6456349
51  0.6 1   Nbar    3.610369207
52  0.6 1   sackin  4596
53  0.6 1   sackin_yule -9.843110763
54  0.6 1   sackin_pda  0.101189958
55  0.6 2   B1  112.5384921
56  0.6 2   Nbar    4.176514032
57  0.6 2   sackin  5655
58  0.6 2   sackin_yule -9.400292666
59  0.6 2   sackin_pda  0.113502287
60  0.6 3   B1  109.9595238
61  0.6 3   Nbar    3.630434783
62  0.6 3   sackin  4843
63  0.6 3   sackin_yule -9.916620532
64  0.6 3   sackin_pda  0.099398705
65  0.6 4   B1  104.0289683
66  0.6 4   Nbar    4.133131619
67  0.6 4   sackin  5464
68  0.6 4   sackin_yule -9.395858086
69  0.6 4   sackin_pda  0.113674619
70  0.6 5   B1  98.8
71  0.6 5   Nbar    3.447641886
72  0.6 5   sackin  4313
73  0.6 5   sackin_yule -9.970985718
74  0.6 5   sackin_pda  0.097475056
75  0.4 1   B1  107.3107143
76  0.4 1   Nbar    3.649173955
77  0.4 1   sackin  3755
78  0.4 1   sackin_yule -9.378914506
79  0.4 1   sackin_pda  0.113759292
80  0.4 2   B1  105.1011905
81  0.4 2   Nbar    3.51625239
82  0.4 2   sackin  3678
83  0.4 2   sackin_yule -9.5445921
84  0.4 2   sackin_pda  0.10872119
85  0.4 3   B1  97.53452381
86  0.4 3   Nbar    3.655306719
87  0.4 3   sackin  3754
88  0.4 3   sackin_yule -9.368892583
89  0.4 3   sackin_pda  0.114061375
90  0.4 4   B1  98.34285714
91  0.4 4   Nbar    3.333010649
92  0.4 4   sackin  3443
93  0.4 4   sackin_yule -9.702833517
94  0.4 4   sackin_pda  0.103701859
95  0.4 5   B1  115.8261905
96  0.4 5   Nbar    3.275482094
97  0.4 5   sackin  3567
98  0.4 5   sackin_yule -9.865897615
99  0.4 5   sackin_pda  0.099257033
100 0.2 1   B1  90.50119048
101 0.2 1   Nbar    3.901939655
102 0.2 1   sackin  3621
103 0.2 1   sackin_yule -8.919632533
104 0.2 1   sackin_pda  0.128087444
105 0.2 2   B1  87.61666667
106 0.2 2   Nbar    3.126728111
107 0.2 2   sackin  2714
108 0.2 2   sackin_yule -9.561238501
109 0.2 2   sackin_pda  0.106128067
110 0.2 3   B1  87.70952381
111 0.2 3   Nbar    3.72
112 0.2 3   sackin  3162
113 0.2 3   sackin_yule -8.926080269
114 0.2 3   sackin_pda  0.127594947
115 0.2 4   B1  88.03333333
116 0.2 4   Nbar    3.089449541
117 0.2 4   sackin  2694
118 0.2 4   sackin_yule -9.607707206
119 0.2 4   sackin_pda  0.104621963
120 0.2 5   B1  89.45
121 0.2 5   Nbar    3.711306257
122 0.2 5   sackin  3381
123 0.2 5   sackin_yule -9.073308361
124 0.2 5   sackin_pda  0.122961062
``````

You could do something along these lines:

``````from itertools import combinations
from scipy.stats import ranksums
``````

``````df.SymmetricDivisionRate = df.SymmetricDivisionRate.astype(str)
selectors = df.SymmetricDivisionRate.unique()
``````

Create all `combinations` of the selectors:

``````cases = combinations(selectors, 2)

[('1.0', '0.8'), ('1.0', '0.6'), ('1.0', '0.4'), ('1.0', '0.2'), ('0.8', '0.6'), ('0.8', '0.4'), ('0.8', '0.2'), ('0.6', '0.4'), ('0.6', '0.2'), ('0.4', '0.2')]
``````

Keep the related data in a `dictionary` (optional, you could select the data on the fly in the next step, but would become a little less readable):

``````means = {s: df.loc[(df['SymmetricDivisionRate']==s) & (df.test=='sackin'), 'value'] for s in selectors}
``````

Calculate the `ranksums` using a `dictionary comprehension` (which you could the convert to `pd.DataFrame`:

``````results = pd.DataFrame({c: ranksums(means[c[0]], means[c[1]]) for c in cases}).T
results.columns = ['z_stat', 'p_val']
``````

to get:

``````           z_stat     p_val
0.4 0.2  2.193378  0.028280
0.6 0.2  2.611165  0.009023
0.4  2.611165  0.009023
0.8 0.2  2.611165  0.009023
0.4  2.611165  0.009023
0.6 -0.731126  0.464702
1.0 0.2  2.611165  0.009023
0.4  2.611165  0.009023
0.6  2.611165  0.009023
0.8  2.611165  0.009023
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download