Salvador Dali Salvador Dali - 3 months ago 9
Python Question

Average of multiple dataframes with the same columns and indices

I have a few dataframes. Each of them has the same columns and the same indices. For each index I want to average the values in each column (if these would be matrices, I would just sum them up and divide by the number of matrices).

Here is the example.

v1 = pd.DataFrame([['ind1', 1, 2, 3], ['ind2', 4, 5, 6]], columns=['id', 'c1', 'c2', 'c3']).set_index('id')
v2 = pd.DataFrame([['ind1', 2, 3, 4], ['ind2', 6, 1, 2]], columns=['id', 'c1', 'c2', 'c3']).set_index('id')
v3 = pd.DataFrame([['ind1', 1, 2, 1], ['ind2', 1, 1, 3]], columns=['id', 'c1', 'c2', 'c3']).set_index('id')


In real situation indices and columns can be in different order.

For this situation the result will be

enter image description here

(the value for ind1, c1 is
(1 + 1 + 2) / 3
, for ind2, c2 is
(1 + 5 + 1) / 3
and so on).

Currently I do this with loops:

dfs = [v1, v2, v3]
cols= ['c1', 'c2', 'c3']

data = []
for ind, _ in dfs[0].iterrows():
vals = [sum(df.loc[ind][col] for df in dfs) / float(len(dfs)) for col in cols]
data.append([ind] + vals)

pd.DataFrame(data, columns=['id'] + cols).set_index('id')


, but this is clearly inefficient for big dataframes with a lot of columns. So how can I achieve this without loops?

Answer

You can use groupby.mean on the index level after concatenating the dataframes:

pd.concat([v1, v2, v3]).groupby(level=0).mean()

            c1        c2        c3
id                                
ind1  1.333333  2.333333  2.666667
ind2  3.666667  2.333333  3.666667