user322778 user322778 - 3 months ago 19
Python Question

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?

To solve my problem, I tried the following code

df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8 == True, axis=0)


The function
x.isnull().sum()/len(x)
is to divide the number of NaN in the column x by the length of x, and the part < 0.8 == True is to choose those columns containing less than 80% of NaN.

The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?

Answer

You could do this:

filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]