durbachit durbachit - 2 months ago 104
Python Question

Pandas - dropping rows with missing data not working using .isnull(), notnull(), dropna()

This is really weird. I have tried several ways of dropping rows with missing data from a pandas dataframe, but none of them seem to work.
This is the code (I just uncomment one of the methods used - but these are the three that I used in different modifications - this is the latest):

import pandas as pd
Test = pd.DataFrame({'A':[1,2,3,4,5],'B':[1,2,'NaN',4,5],'C':[1,2,3,'NaT',5]})
print(Test)
#Test = Test.ix[Test.C.notnull()]
#Test = Test.dropna()
Test = Test[~Test[Test.columns.values].isnull()]
print "And now"
print(Test)


But in all cases, all I get is this:

A B C
0 1 1 1
1 2 2 2
2 3 NaN 3
3 4 4 NaT
4 5 5 5
And now
A B C
0 1 1 1
1 2 2 2
2 3 NaN 3
3 4 4 NaT
4 5 5 5


Is there any mistake that I am making? or what is the problem? Ideally, I would like to get this:

A B C
0 1 1 1
1 2 2 2
4 5 5 5

Answer

Your example DF has NaN and NaT as strings which .dropna, .notnull and co. won't consider falsey, so given your example you can use...

df[~df.isin(['NaN', 'NaT']).any(axis=1)]

Which gives you:

   A  B  C
0  1  1  1
1  2  2  2
4  5  5  5

If you had a DF such as (note of the use of np.nan and np.datetime64('NaT') instead of strings:

df = pd.DataFrame({'A':[1,2,3,4,5],'B':[1,2,np.nan,4,5],'C':[1,2,3,np.datetime64('NaT'),5]})

Then running df.dropna() which give you:

   A    B  C
0  1  1.0  1
1  2  2.0  2
4  5  5.0  5

Note that column B is now a float instead of an integer as that's required to store NaN values.

Comments