luisfer luisfer - 5 months ago 35
Python Question

How to use a user function to fillna() in pandas

This is a fragment of the dataframe I have:

Title | Age
Mr. | 30
Mr. | NaN
Mr. | 32
Mrs. | 28
Mrs. | 16
Mr. | 34
Mrs. | NaN

Edit: I added the last row, to clarify the question

I want to impute the NaNs (second and last row), for the second row, it should use the mean of the other "Mr." in the dataframe, so in this case, should be 32, in the last row it should use the mean of the other "Mrs.", so should be 22

To calculate the mean is as easy as doing

value = df.loc[df["Title"] == "Mr."]["Age"].mean()

So I wrote a function called agefun:

def agefun(df, t):
return df.loc[df["Title"] == t]["Age"].mean()

And it works, now, how can I use this function with the fillna() function? I'd like something like:

df['Age'].fillna(agefun(df, this_row_title))

But of course it doesn't work, I don't know how to tell the function I like the value corresponding to the Title in that specific row.

How can this be performed?


Transform keeps the same shape as the original series in the dataframe.

df['Age'] = df.groupby('Title').transform(lambda group: group.fillna(group.mean()))

>>> df
  Title  Age
0   Mr.   30
1   Mr.   32  # (30 + 32 + 34) / 3 = 32
2   Mr.   32
3  Mrs.   28
4  Mrs.   16
5   Mr.   34

In the example above, it keeps all of the values unchanged except for the one NaN value on the second row which it fills by calculating the mean for the group, i.e. the mean value of all rows where the Title is Mr..