B.Mr.W. B.Mr.W. - 1 year ago 81
Python Question

dplyr summarize equivalent in pandas

I used to work with R and really love the dplyr package which you can easily group by and summarize.

However, in pandas, I don't see an equivalent of summarize and here is how I achieve it in Python:

import pandas as pd
data = pd.DataFrame(
result = []
for k,v in data.groupby('col1'):
result.append([k, max(v['col2']), min(v['col3'])])
print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg'])

It is not only very verbose, but also might not be the most optimized and efficient. (I used to rewrite a
for loop groupby
implementation into
and the performance enhancement was huge).

In R the code will be

data %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))

Is there an efficient equivalent in Python or for loop is what I have to work with.

also, @ayhan really gave a solution to my answer, this is a follow-up question that I will list here instead of the comment:

what is the equivalent of
groupby().summarize(newcolumn=max(col2 * col3))

Answer Source

The equivalent in pandas would be something like this:

data.groupby(['col1']).agg({'col2': max, 'col3': min})
      col2  col3
1        5    -5
2        9    -9

If you don't want col1 as an index, you can add .reset_index() at the end.

In general, you can pass multiple functions to agg:

data.groupby(['col1']).agg({'col2': [max, min], 'col3': [min, 'count']})
     col2     col3      
      max min  min count
1       5   1   -5     5
2       9   0   -9     5

Note that max and min are Python functions so you can use them directly. But for other functions you need to use aliases ('count', 'mean' etc. You can also use np.mean, np.var, ... These numpy functions will be translated to pandas functions.)

If you want to label each returning output you can use a dictionary of dictionaries:

data.groupby(['col1']).agg({'col2': {'col2_max': max, 'col2_min': min}, 
                            'col3': {'col3_min': min, 'col3_count': 'count'}})
         col2              col3           
     col2_max col2_min col3_min col3_count
1           5        1       -5          5
2           9        0       -9          5