Ger - 1 year ago 79

Python Question

My question is about groupby operation with pandas. I have the following DataFrame :

`In [4]: df = pd.DataFrame({"A": range(4), "B": ["PO", "PO", "PA", "PA"], "C": ["Est", "Est", "West", "West"]})`

In [5]: df

Out[5]:

A B C

0 0 PO Est

1 1 PO Est

2 2 PA West

3 3 PA West

This is what I would like to do : I want to group by column B and do a sum on column A. But at the end, I would like column C to still be in the DataFrame. If I do :

`In [8]: df.groupby(by="B").aggregate(pd.np.sum)`

Out[8]:

A

B

PA 5

PO 1

It does the job but column C is missing. I can also do this :

`In [9]: df.groupby(by=["B", "C"]).aggregate(pd.np.sum)`

Out[9]:

A

B C

PA West 5

PO Est 1

or

`In [11]: df.groupby(by=["B", "C"], as_index=False).aggregate(pd.np.sum)`

Out[11]:

B C A

0 PA West 5

1 PO Est 1

But in both cases it group by B AND C and not just B and keeps the C value. Is what I want to do irrelevant or is there a way to do it ?

Answer Source

try to use DataFrameGroupBy.agg() method with `dict of {column -> function}`

:

```
In [6]: df.groupby('B').agg({'A':'sum', 'C':'first'})
Out[6]:
C A
B
PA West 5
PO Est 1
```

From docs:

Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.

or something like this depending on your goals:

```
In [8]: df = pd.DataFrame({"A": range(4), "B": ["PO", "PO", "PA", "PA"], "C": ["Est1", "Est2", "West1", "West2"]})
In [9]: df.groupby('B').agg({'A':'sum', 'C':'first'})
Out[9]:
C A
B
PA West1 5
PO Est1 1
In [10]: df['sum_A'] = df.groupby('B')['A'].transform('sum')
In [11]: df
Out[11]:
A B C sum_A
0 0 PO Est1 1
1 1 PO Est2 1
2 2 PA West1 5
3 3 PA West2 5
```