John - 1 year ago 200
R Question

# Python Pandas: select column with the number of unique values greater than 10

In R, we can use

`sapply`
to extract columns with the number of unique values greater than 10 by:

``````X[,  sapply(X,  function(x) length(unique(x)))  >=10]
``````

How can we do the same thing in Python Pandas?

Also, how can we choose columns with missing proportion less than 10% like what we can do in R:

``````X[,  sapply(X,  function(x) sum(is.na(x))/length(x)  )  < 0.1]
``````

Thanks.

You can use `nunique` with `apply`, because it works only with `Series`:

``````print (df.ix[:, df.apply(lambda x: x.nunique()) >= 10])
``````

and second `isnull` with `mean`:

``````print (df.ix[:, df.isnull().mean() <  0.1])
``````

Sample:

``````df = pd.DataFrame({'A':[1,np.nan,3],
'B':[4,4,np.nan],
'C':[7,8,9],
'D':[3,3,5]})

print (df)
A    B  C  D
0  1.0  4.0  7  3
1  NaN  4.0  8  3
2  3.0  NaN  9  5
``````
``````print (df.ix[:, df.apply(lambda x: x.nunique()) >= 2])
A  C  D
0  1.0  7  3
1  NaN  8  3
2  3.0  9  5

print (df.isnull().sum())
A    1
B    1
C    0
D    0
dtype: int64

print (df.isnull().sum() / len(df.index))

A    0.333333
B    0.333333
C    0.000000
D    0.000000
dtype: float64

print (df.isnull().mean())
A    0.333333
B    0.333333
C    0.000000
D    0.000000
dtype: float64

print (df.ix[:, df.isnull().sum() / len(df.index) <  0.1])
C  D
0  7  3
1  8  3
2  9  5
``````

Or:

``````print (df.ix[:, df.isnull().mean() <  0.1])
C  D
0  7  3
1  8  3
2  9  5
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download