zsljulius zsljulius - 1 month ago 12
Python Question

How to calculate block averages in pandas DataFrame

Suppose, we have a DataFrame like so.

df = pd.DataFrame(np.random.randn(10,5), columns=['a','b','c','d','e'])
idx = np.random.randint(0,2,(10,5))
df = abs(df * idx)


In my specific case

a b c d e
0 0.000000 0.000000 0.000000 0.000000 0.976492
1 0.000000 3.589563 0.000000 0.400287 0.000000
2 0.000000 0.000000 0.247323 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.079501 0.000000
4 0.136868 0.000000 0.223572 0.153744 0.000000
5 0.000000 0.000000 0.494390 0.311590 0.000000
6 0.000000 0.000000 0.756897 0.000000 0.039769
7 0.000000 0.000000 0.000000 1.006561 0.000000
8 0.651060 0.588797 0.000000 0.000000 0.028943
9 1.040841 0.000000 0.000000 0.000000 0.879489


I would like to calculate the means of each non-zero blocks. For example, second to the last column (d column) would yield a series of the form:

0 0.000000
1 0.400287
2 0.000000
3 0.11662267564906763
4 0.11662267564906763
5 0.11662267564906763
6 0.000000
7 1.006561
8 0.000000
9 0.000000


Clarification:

What I meant by non-zero blocks is that for any given column, those blocks with consecutive non-zero entrys. For instance, column a would have 2 blocks, one occurs at index=4, value=0.136868, and the other occurs at index=8,9 and value = [0.651060,1040841]. I want the mean of each blocks.

Clarification 2:
In the case of column a, the output would be

0 0.000000
1 0.000000
2 0.000000
3 0.000000
4 0.136868
5 0.000000
6 0.000000
7 0.000000
8 0.8459504999999999
9 0.8459504999999999


where position 8,9 yields the average of the last two entries.

Answer

You can first get groups of consecutives values by DataFrame.shift with DataFrame.cumsum to df1:

a = df != 0
df1 = (a != a.shift()).cumsum()
print (df1)
   a  b  c  d  e
0  1  1  1  1  1
1  1  2  1  2  2
2  1  3  2  3  2
3  1  3  3  4  2
4  2  3  4  4  2
5  3  3  4  4  2
6  3  3  4  5  3
7  3  3  5  6  4
8  4  4  5  7  5
9  4  5  5  7  5

Then groupby by columns from df per groups from df1 with transform:

print (df.a.groupby([df1.a]).transform('mean'))
0    0.000000
1    0.000000
2    0.000000
3    0.000000
4    0.136868
5    0.000000
6    0.000000
7    0.000000
8    0.845951
9    0.845951
Name: a, dtype: float64

If need output of all columns use list comprehension with concat:

df2 = pd.concat([df[col].groupby([df1[col]]).transform('mean') for col in df], axis=1)
print (df2)
          a         b         c         d         e
0  0.000000  0.000000  0.000000  0.000000  0.976492
1  0.000000  3.589563  0.000000  0.400287  0.000000
2  0.000000  0.000000  0.247323  0.000000  0.000000
3  0.000000  0.000000  0.000000  0.181612  0.000000
4  0.136868  0.000000  0.491620  0.181612  0.000000
5  0.000000  0.000000  0.491620  0.181612  0.000000
6  0.000000  0.000000  0.491620  0.000000  0.039769
7  0.000000  0.000000  0.000000  1.006561  0.000000
8  0.845951  0.588797  0.000000  0.000000  0.454216
9  0.845951  0.000000  0.000000  0.000000  0.454216
Comments