Gabdu Gunnu Gabdu Gunnu - 5 months ago 21
Python Question

Filling NAN data with mode() doesn't work -Pandas

I have a data set in which there is a series known as

Outlet_Size
which contain either of
{'Medium', nan, 'High', 'Small'}
around 2566 records are missing so I thought to fill it with mode() value so I wrote something like this :

train['Outlet_Size']=train['Outlet_Size'].fillna(train['Outlet_Size'].dropna().mode()]


But when I tried to find number of missing NaN record by command

sum(train['Outlet_Size'].isnull())


it is still showing 2566 NaN records.Why is it so ?

Thank you for answers

Answer

The problem here is that mode returns a series and this is causing the fillna to fail, if we look at a simple example:

In [194]:    
df = pd.DataFrame({'a':['low','low',np.NaN,'medium','medium','medium','medium']})
df

Out[194]:
        a
0     low
1     low
2     NaN
3  medium
4  medium
5  medium
6  medium

In [195]:    
df['a'].fillna(df['a'].mode())

Out[195]:
0       low
1       low
2       NaN
3    medium
4    medium
5    medium
6    medium
Name: a, dtype: object

So you can see that it fails above, if we look at what mode returns:

In [196]:    
df['a'].mode()

Out[196]:
0    medium
dtype: object

it's a series albeit with a single row, so when you pass this to fillna it only fills the first row, so what you want is to get the scalar value by indexing into the Series:

In [197]:    
df['a'].fillna(df['a'].mode()[0])

Out[197]:
0       low
1       low
2    medium
3    medium
4    medium
5    medium
6    medium
Name: a, dtype: object

EDIT

Regarding whether dropna is required, no it isn't:

In [204]:
df = pd.DataFrame({'a':['low','low',np.NaN,'medium','medium','medium','medium',np.NaN,np.NaN,np.NaN,np.NaN]})
df['a'].mode()

Out[204]:
0    medium
dtype: object

You can see that NaN is ignored