foebu foebu - 8 months ago 444
Python Question

PySpark Dataframes: how to filter on multiple conditions with compact code?

If I have a list of column names and I want to filter on rows if the value of those columns are greater than zero, is there something similar to this which I can do?

columns = ['colA','colB','colC','colD','colE','colF']
new_df = df.filter(any([df[c]>0 for c in columns]))

This returns:

ValueError: Cannot convert column into bool: please use '&' for 'and',
'|' for 'or', '~' for 'not' when building DataFrame boolean

I guess I can sum those columns and the filter on only one column (since I don't have negative numbers. But if I had the sum-trick wouldn't work. And anyway if I had to filter those columns on another condition different than the sum, how could I do what I want to do?
Any idea?


You can use the or_ operator instead :

from operator import or_
from functools import reduce

newdf = df.where(reduce(or_, (df[c] > 0 for c in df.columns)))

EDIT: More pythonista solution :

def any_(*preds):
    cond = lit(False)
    for pred in preds:
        cond = cond | pred
    return cond

newdf = df.where(any_(*[df[c] > 0 for c in df.columns]))