Jeff L. - 1 year ago 116

Python Question

I have several datasets where I need to discard observations if all the values in a subset of columns are equal to 0 or N (depending on the column). For example:

`df = pd.DataFrame({'one':[0,0,1,2,0], 'two':[0,0,0,0,0], 'three':['N','Y','N','Y','N']})`

drop_subset = ['one', 'three']

In [4]: df

Out[4]:

one three two

0 0 N 0

1 0 Y 0

2 1 N 0

3 2 Y 0

4 0 N 0

I need to look at just columns

`one`

`three`

`df[~(df[drop_subset] == 0).all(axis=1)]`

And it would work fine. But when I try to expand it out:

`df[~(df[drop_subset] == 0 or df[drop_subset] == 'N').all(axis=1)]`

I get the dreaded:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(),a.item(), a.any() or a.all().

I originally thought to treat each of the inner conditions using the

`any`

`all`

`In [71]: (df[drop_subset] == 0)`

Out[71]:

one three

0 True False

1 True False

2 False False

3 False False

4 True False

without simultaneously testing for the 'N' value also. And while it seems obvious on this small dataframe that I could test column

`one`

`three`

`drop_subset`

My last resort is using apply with a lambda, but it would be painfully slow due to the size of the data. Plus I feel like there must be a vectorized way to express this logic in Panadas, and that's what I'm looking for.

Answer Source

IIUC you can use DataFrame.isin() method:

```
In [68]: df[df[drop_subset].isin([0,'N']).all(1)]
Out[68]:
one three two
0 0 N 0
4 0 N 0
In [69]: df[~df[drop_subset].isin([0,'N']).all(1)]
Out[69]:
one three two
1 0 Y 0
2 1 N 0
3 2 Y 0
```