piRSquared piRSquared - 5 months ago 28
Python Question

groupby and normalize over two arrays

I have a

DataFrame
where the columns are a
MultiIndex
. The first
level
specifies
'labels'
, the second specifies
'values'
. A
'label'
in the
(i, j)
position of
df.labels
corresponds to the
'value'
in the
(i, j)
position of
df.values
.

I want to rescale the
'values'
such that they sum to one within each group defined by the corresponding
'labels'
.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.choice(('a', 'b', 'c', 'd'),
(10, 5), p=(.4, .3, .2, .1)))
df2 = pd.DataFrame((np.random.rand(10, 5) * 10).round(0))

df = pd.concat([df1, df2], axis=1, keys=['labels', 'values'])
print df

labels values
0 1 2 3 4 0 1 2 3 4
0 b b b b b 5.0 2.0 7.0 7.0 4.0
1 a c c c c 6.0 8.0 1.0 5.0 7.0
2 d c c d c 6.0 3.0 10.0 7.0 4.0
3 a a a b a 5.0 9.0 9.0 5.0 8.0
4 a b a c c 0.0 4.0 1.0 8.0 0.0
5 c b a a b 1.0 6.0 8.0 6.0 1.0
6 c c c a c 9.0 9.0 4.0 1.0 1.0
7 d c a b c 7.0 0.0 3.0 6.0 4.0
8 b a b a a 8.0 6.0 3.0 5.0 4.0
9 c c c b c 2.0 5.0 3.0 1.0 3.0


I expect the results to look like this:

labels values
0 1 2 3 4 0 1 2 3 4
0 b b b b b 0.084746 0.033898 0.118644 0.118644 0.067797
1 a c c c c 0.084507 0.091954 0.011494 0.057471 0.080460
2 d c c d c 0.300000 0.034483 0.114943 0.350000 0.045977
3 a a a b a 0.070423 0.126761 0.126761 0.084746 0.112676
4 a b a c c 0.000000 0.067797 0.014085 0.091954 0.000000
5 c b a a b 0.011494 0.101695 0.112676 0.084507 0.016949
6 c c c a c 0.103448 0.103448 0.045977 0.014085 0.011494
7 d c a b c 0.350000 0.000000 0.042254 0.101695 0.045977
8 b a b a a 0.135593 0.084507 0.050847 0.070423 0.056338
9 c c c b c 0.022989 0.057471 0.034483 0.016949 0.034483

Answer

To get the normalized values, you could:

new_values = pd.DataFrame(data=np.zeros(df['values'].shape))
for v in np.unique(df['labels']):
    mask = df['values'].where(df['labels'].isin([v]))
    new_values += mask.div(mask.sum().sum()).fillna(0)
df.loc[:, 'values'] = new_values.values

also as a somewhat unreadable oneliner:

df.loc[:, 'values'] = np.sum([df['values'].where(df['labels'].isin([v])).div(df['values'].where(df['labels'].isin([v])).sum().sum()).fillna(0).values for v in np.unique(df['labels'])], axis=0)

or, using .groupby():

tmp = pd.DataFrame(np.hstack((df['labels'].values.reshape(-1, 1), df['values'].values.reshape(-1, 1))))
df.loc[:, 'values'] = tmp.groupby(0).transform(lambda x: x/x.sum()).values.reshape(df['values'].shape)

both result in:

  labels                values                                        
       0  1  2  3  4         0         1         2         3         4
0      b  b  b  b  b  0.084746  0.033898  0.118644  0.118644  0.067797
1      a  c  c  c  c  0.084507  0.091954  0.011494  0.057471  0.080460
2      d  c  c  d  c  0.300000  0.034483  0.114943  0.350000  0.045977
3      a  a  a  b  a  0.070423  0.126761  0.126761  0.084746  0.112676
4      a  b  a  c  c  0.000000  0.067797  0.014085  0.091954  0.000000
5      c  b  a  a  b  0.011494  0.101695  0.112676  0.084507  0.016949
6      c  c  c  a  c  0.103448  0.103448  0.045977  0.014085  0.011494
7      d  c  a  b  c  0.350000  0.000000  0.042254  0.101695  0.045977
8      b  a  b  a  a  0.135593  0.084507  0.050847  0.070423  0.056338
9      c  c  c  b  c  0.022989  0.057471  0.034483  0.016949  0.034483
Comments