river_jones river_jones - 18 days ago 11
Python Question

How to use a custom pandas groupby aggregation function to combine rows in a dataframe

I have a dataframe with a

name
column and a
department
column. There are repeats in the
name
column that have different
department
values but all other column values are identical. I'd like to flatten these repeats into a single row and combine the different (unique) department values into a list. So, take first row of each group and just change the
department
value to a list of the unique
department
values in that group. So resulting dataframe should have exact same columns but no repeats in
name
column and
department
column now has lists of at least one element.

I thought to use
groupby
and a custom aggregation function passed to
agg()
but the following just totally fails. My thinking was that my aggregation function would get each group as a dataframe and if for each dataframe group I returned a series then the output of
groupby.agg(flatten_departments)
would be a dataframe.

def flatten_departments(name_group):
#I thought name_group would be a df of that group

#this group is length 1 so this name doesn't actually repeat so just return same row
if len(name_group) == 1:
return name_group.squeeze() #turn length-1 df into a series to return, don't worry that department is a string and not a list for now
else:
#treat name_group like a df and get the unique departments
departments = list(name_group['department'].unique())
name_ser = name_group.iloc[0,:] #take first "row" of this group
name_ser['department'] = departments #replace department value with list of unique values from group

return name_ser

my_df = my_df.groupby(['name']).agg(flatten_departments)


This was a disaster and
name_group
is not a df but a series whose index is an index from the original df, and name is the name of some other column in the original df and value the value for that column.

I know that I could just do a for loop over the
groupby
object as follows

list_of_ser = []
for name, gp in my_df.groupby(['name']):
if len(gp) == 1:
list_of_ser.append(gp.squeeze())
else:
new_ser = gp.iloc[0,:]
new_ser['department'] = list(gp['department'].unique())
list_of_ser.append(new_ser)

new_df = pd.DataFrame(list_of_ser, columns=my_df.columns)


but I just thought that was the point of
agg
!

Any ideas how to accomplish my goal with
agg
or if the for loop is really the correct way. If the for loop is the right way, what is the point of
agg
?

Thank you!

Answer
df = pd.DataFrame(
    dict(
        name=list('ABCDEFGACEF'),
        dept=list('xyxyzxyzyxz')
    )
)

df.groupby('name').dept.apply(list).reset_index()

enter image description here


agg could have been used like this

df.groupby('name').dept.agg(dict(dept=lambda x: list(x))).reset_index()