zsljulius - 1 month ago 12
Python Question

# How to calculate block averages in pandas DataFrame

Suppose, we have a DataFrame like so.

``````df = pd.DataFrame(np.random.randn(10,5), columns=['a','b','c','d','e'])
idx = np.random.randint(0,2,(10,5))
df = abs(df * idx)
``````

In my specific case

``````    a   b   c   d   e
0   0.000000    0.000000    0.000000    0.000000    0.976492
1   0.000000    3.589563    0.000000    0.400287    0.000000
2   0.000000    0.000000    0.247323    0.000000    0.000000
3   0.000000    0.000000    0.000000    0.079501    0.000000
4   0.136868    0.000000    0.223572    0.153744    0.000000
5   0.000000    0.000000    0.494390    0.311590    0.000000
6   0.000000    0.000000    0.756897    0.000000    0.039769
7   0.000000    0.000000    0.000000    1.006561    0.000000
8   0.651060    0.588797    0.000000    0.000000    0.028943
9   1.040841    0.000000    0.000000    0.000000    0.879489
``````

I would like to calculate the means of each non-zero blocks. For example, second to the last column (d column) would yield a series of the form:

``````0    0.000000
1    0.400287
2    0.000000
3    0.11662267564906763
4    0.11662267564906763
5    0.11662267564906763
6    0.000000
7    1.006561
8    0.000000
9    0.000000
``````

Clarification:

What I meant by non-zero blocks is that for any given column, those blocks with consecutive non-zero entrys. For instance, column a would have 2 blocks, one occurs at index=4, value=0.136868, and the other occurs at index=8,9 and value = [0.651060,1040841]. I want the mean of each blocks.

Clarification 2:
In the case of column a, the output would be

``````0    0.000000
1    0.000000
2    0.000000
3    0.000000
4    0.136868
5    0.000000
6    0.000000
7    0.000000
8    0.8459504999999999
9    0.8459504999999999
``````

where position 8,9 yields the average of the last two entries.

You can first get groups of consecutives values by `DataFrame.shift` with `DataFrame.cumsum` to `df1`:

``````a = df != 0
df1 = (a != a.shift()).cumsum()
print (df1)
a  b  c  d  e
0  1  1  1  1  1
1  1  2  1  2  2
2  1  3  2  3  2
3  1  3  3  4  2
4  2  3  4  4  2
5  3  3  4  4  2
6  3  3  4  5  3
7  3  3  5  6  4
8  4  4  5  7  5
9  4  5  5  7  5
``````

Then `groupby` by columns from `df` per groups from `df1` with `transform`:

``````print (df.a.groupby([df1.a]).transform('mean'))
0    0.000000
1    0.000000
2    0.000000
3    0.000000
4    0.136868
5    0.000000
6    0.000000
7    0.000000
8    0.845951
9    0.845951
Name: a, dtype: float64
``````

If need output of all columns use list comprehension with `concat`:

``````df2 = pd.concat([df[col].groupby([df1[col]]).transform('mean') for col in df], axis=1)
print (df2)
a         b         c         d         e
0  0.000000  0.000000  0.000000  0.000000  0.976492
1  0.000000  3.589563  0.000000  0.400287  0.000000
2  0.000000  0.000000  0.247323  0.000000  0.000000
3  0.000000  0.000000  0.000000  0.181612  0.000000
4  0.136868  0.000000  0.491620  0.181612  0.000000
5  0.000000  0.000000  0.491620  0.181612  0.000000
6  0.000000  0.000000  0.491620  0.000000  0.039769
7  0.000000  0.000000  0.000000  1.006561  0.000000
8  0.845951  0.588797  0.000000  0.000000  0.454216
9  0.845951  0.000000  0.000000  0.000000  0.454216
``````