johnbaltis johnbaltis - 28 days ago 10
Python Question

pandas, apply multiple functions of multiple columns to groupby object

I want to apply multiple functions of multiple columns to a groupby object which results in a new

pandas.DataFrame
.

I know how to do it in seperate steps:

by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))


Which results in
user_df
being:
user_df

However I suspect that there is a better way, like:

by_user.agg({'elapsed_days': lambda x: (x.elapsed_time * x.num_cores).sum() / 86400,
'running_days': lambda x: (x.running_time * x.num_cores).sum() / 86400})


However, this doesn't work, because AFAIK
agg()
works on
pandas.Series
.

I did find this question and answer, but the solutions look rather ugly to me, and considering that the answer is nearly four years old, there might be a better way by now.

Answer

I think you can avoid agg or apply and rather first multiple by mul, then div and last use groupby by index with aggregating sum:

lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

print (lasts)
   elapsed_time  num_cores  running_time user
0         40000          7         30000    a
1         50000          8         20000    s
2         60000          9         30000    d
3         90000          4         15000    d
by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
print (elapsed_days)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))
print (user_df)
      elapsed_days  running_days
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852
lasts = lasts.set_index('user')
print (lasts[['elapsed_time','running_time']].mul(lasts['num_cores'], axis=0)
                                             .div(86400)
                                             .groupby(level=0)
                                             .sum())
      elapsed_time  running_time
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852