Ryan Erwin Ryan Erwin - 7 months ago 13
Python Question

Assign new value in DataFrame column based on group max

I have a data frame and I'd like to add a new column based the value in two columns, where one uses the group max. For example, here's an example

DataFrame


import pandas as pd
import numpy as np


df = pd.DataFrame({"col1": ['a'] * 2 + ['b'] * 4,
"col2": [1, 2, 1, 2, 3, 4],
"col3": ['yep', 'no', 'no', 'no', 'yep', 'yep']})


df
col1 col2 col3
0 a 1 yep
1 a 2 no
2 b 1 no
3 b 2 no
4 b 3 yep
5 b 4 yep


Now, what I'd like is a new column that said "yes" where
col2
is the group max, grouping on
col1
and
col3
is equal to "yep".

col1 col2 col3 col4
0 a 1 yep no
1 a 2 no no
2 b 1 no no
3 b 2 no no
4 b 3 yep no
5 b 4 yep yes


I've tried the code below, but I'm getting a slight error on the return value:

# define function
def max_bal(df):
max_row = df['col2'].max()
label = np.where((df['col3'] == 'yep') &
(df['col2'] == max_row),
'Yes',
'No')
return label


Now call function in a group by statement

df.groupby('col1').apply(max_bal)

col1
a [No, No]
b [No, No, No, Yes]
dtype: object


These are the correct values but not returned in a
Series
as I would expect. I'd need one
Series
/
array
/
list
because I'm trying to assign the these values to a new column in the
df
above (see `col4'). Any pointers would be greatly appreciated.

Answer

You can return labels as a DataFrame with the same index as the group the function is being applied on.

def max_bal(df):
    max_row = df['col2'].max()
    labels = np.where((df['col3'] == 'yep') & 
                     (df['col2'] == max_row),
                     'Yes', 
                     'No')
    return pd.DataFrame(labels, index=df.index)

df['col4'] = df.groupby('col1').apply(max_bal)