John - 1 year ago 200

R Question

In R, we can use

`sapply`

`X[, sapply(X, function(x) length(unique(x))) >=10]`

How can we do the same thing in Python Pandas?

Also, how can we choose columns with missing proportion less than 10% like what we can do in R:

`X[, sapply(X, function(x) sum(is.na(x))/length(x) ) < 0.1]`

Thanks.

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

You can use `nunique`

with `apply`

, because it works only with `Series`

:

```
print (df.ix[:, df.apply(lambda x: x.nunique()) >= 10])
```

```
print (df.ix[:, df.isnull().mean() < 0.1])
```

Sample:

```
df = pd.DataFrame({'A':[1,np.nan,3],
'B':[4,4,np.nan],
'C':[7,8,9],
'D':[3,3,5]})
print (df)
A B C D
0 1.0 4.0 7 3
1 NaN 4.0 8 3
2 3.0 NaN 9 5
```

```
print (df.ix[:, df.apply(lambda x: x.nunique()) >= 2])
A C D
0 1.0 7 3
1 NaN 8 3
2 3.0 9 5
print (df.isnull().sum())
A 1
B 1
C 0
D 0
dtype: int64
print (df.isnull().sum() / len(df.index))
A 0.333333
B 0.333333
C 0.000000
D 0.000000
dtype: float64
print (df.isnull().mean())
A 0.333333
B 0.333333
C 0.000000
D 0.000000
dtype: float64
print (df.ix[:, df.isnull().sum() / len(df.index) < 0.1])
C D
0 7 3
1 8 3
2 9 5
```

Or:

```
print (df.ix[:, df.isnull().mean() < 0.1])
C D
0 7 3
1 8 3
2 9 5
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**