JonathanBechtel JonathanBechtel - 2 months ago 18
Python Question

Applying a custom function on a Pandas series using groupby and pd.isnull

I have a sample dataframe which generically looks like this:

df = pd.Dataframe({'Class': [1, 2, 3, 2, 1, 2, 3, 2],
'Sex': [1, 0, 0, 0, 1, 1, 0, 1],
'Age': [15, 24, 13, 28, 29, NaN, 34, 27]})


Which displays as:

Age Class Sex
0 15.0 1 1
1 24.0 2 0
2 13.0 2 0
3 28.0 2 0
4 29.0 1 1
5 NaN 2 1
6 34.0 1 0
7 27.0 2 1


What I'd like to do is fill in each of the NaN values in the 'Age' series with the median value for all entries that have their 'Class' and 'Sex' grouping.

So for example, when I access these values like so:

df.groupby(['Class', 'Sex'])['Age'].median()


and get:

Class Sex
1 0 34.0
1 22.0
2 0 24.0
1 27.0


I'd like to write a function that automatically fills the extant NaN value with 27 since that is the median of the entries that have a Class value of 2 and a Sex value of 1.

Right now I have:

df['Age'] = df.groupby(['Class', 'Sex'])['Age'].apply(lambda x: x.median() if pd.isnull(x) else x)


and am getting the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


However, a very similar syntax was used in the answers for this question and this one, so I'm not quite sure why mine doesn't work, particularly, the latter also uses the isnull method in its lambda function so it's not clear to me why mine doesn't work but that one does.

I've also tried using the fillna method like so:

df['Age'] = df['Age'].fillna(df.groupby(['Class', 'Sex'])['Age'].median())


But got the following error message:

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'


I'm open to other methods which get the same value, but prefer something that relies entirely on Pandas methods without having to use a separate for-loop and passing it into the 'Apply' method to make it as concise as possible.

Thank you.

Answer

One option would be to use transform to replace null values with median for the Age column:

df['Age'] = df.groupby(['Class', 'Sex']).Age.transform(lambda col: col.where(col.notnull(), col.median()))

df

#   Age Class   Sex
#0  15.0    1   1
#1  24.0    2   0
#2  13.0    3   0
#3  28.0    2   0
#4  29.0    1   1
#5  27.0    2   1
#6  34.0    3   0
#7  27.0    2   1

Alternatively use replace method instead of where also works:

df['Age'] = df.groupby(['Class', 'Sex']).Age.transform(lambda col: col.replace(np.nan, col.median()))