ADS ADS - 1 month ago 6
Python Question

Groupby, editing groups and re-joining them efficiently in Pandas

In pandas, I have been looking for a general flow to group a dataframe by a certain column, perform non-trivial operations on the groups, and then reconstitute the groups again back into a big dataframe (by effectively stacking them on top of each other).

Imagine I have a DataFrame

df
:

+----+-------+---+---+---+
| | A | B | C | D |
+----+-------+---+---+---+
| 0 | Green | 1 | 4 | 5 |
| 1 | Red | 2 | 3 | 2 |
| 2 | Red | 1 | 4 | 3 |
| 3 | Green | 2 | 2 | 2 |
| 4 | Green | 1 | 1 | 1 |
| 5 | Blue | 2 | 1 | 5 |
| 6 | Red | 2 | 1 | 6 |
| 7 | Blue | 7 | 8 | 9 |
| 8 | Green | 7 | 6 | 5 |
| 9 | Red | 0 | 9 | 0 |
| 10 | Blue | 4 | 5 | 4 |
+----+-------+---+---+---+


I would like to groupby() column A and then perform an operation on each group. Typically this operation involves creating new rows by comparing the value in one row with the value in the row, for all rows, so I wouldn't say it could be done with a lambda function applied to the groups. Then, I want to put these groups back together into dataframe, effectively in the same format as above but with the inserted rows.

My general approach so far has been to do it the "slow and stupid" way, i.e:

group_list = []

g = df.groupby("A")
for i, group in g:

###Perform some weird operation on group that can't really be reduced to a
#lambda function applied to each group.

group_list.append(group)

reconstituted = group_list[0]
for i in range(1,len(group_list)):
reconstituted = reconstituted.append(group_list[i], ignore_index=True)


Clearly this isn't particularly pandas-esque, so that is my question - what is a better way of operating on the groups themselves and then reconstituting them?

Answer Source

Without knowing about what your function does, if all you want to do is just join them back, you can use pd.concat:

df_new = pd.concat(group_list)

MVCE:

In [77]: df1
Out[77]: 
   0
0  a
1  b

In [78]: df2
Out[78]: 
   0
0  c
1  d

In [79]: pd.concat([df1, df2], ignore_index=True)
Out[79]: 
   0
0  a
1  b
0  c
1  d

However, I would urge you to consider a different technique which doesn't involve explicitly splitting the groups and working on them separately, that's very inefficient.