neversaint neversaint - 1 month ago 10
Python Question

Delete Pandas Dataframe columns if there are zeros in more than K rows

I have the following data frame:

import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e'],
'A':[-14,-90,-90,-96,-91],
'B':[-103,0,-110,-114,-114],
'D':[0,0,0,0,0],
'C':[-101,0,-110,0,0]})


It looks like this:

A B C D id
0 -14 -103 -101 0 a
1 -90 0 0 0 b
2 -90 -110 -110 0 c
3 -96 -114 0 0 d
4 -91 -114 0 0 e


What I want to do is to perform operation to delete any column if there are 0 in more than 2 rows. How can I achieve that?

Finally will have data frame with this column: A, B, id.

Answer

You can use cumsum with any for mask and then a bit changed boolean indexing for selecting by columns:

mask = ((df == 0).cumsum() > 1).any()
print (mask)
A     False
B     False
C      True
id    False
dtype: bool

print (df.ix[:, ~mask])
    A    B id
0 -14 -103  a
1 -90    0  b
2 -90 -110  c
3 -96 -114  d
4 -91 -114  e

Explanation of masks:

print (df == 0)
       A      B      C     id
0  False  False  False  False
1  False   True   True  False
2  False  False  False  False
3  False  False   True  False
4  False  False   True  False

print ((df == 0).cumsum())
   A  B  C  id
0  0  0  0   0
1  0  1  1   0
2  0  1  1   0
3  0  1  2   0
4  0  1  3   0

print ((df == 0).cumsum() > 1)
       A      B      C     id
0  False  False  False  False
1  False  False  False  False
2  False  False  False  False
3  False  False  False  False
4  False  False   True  False

EDIt by comment - need all for mask:

mask = (df == 0).all()
print (mask)
A     False
B     False
C     False
D      True
id    False
dtype: bool

print (df.ix[:, ~mask])
    A    B    C id
0 -14 -103 -101  a
1 -90    0    0  b
2 -90 -110 -110  c
3 -96 -114    0  d
4 -91 -114    0  e