max max - 4 months ago 130
Python Question

pandas equivalent of np.where

np.where
has the semantics of a vectorized if/else (similar to Apache Spark's
when
/
otherwise
DataFrame method). I know that I can use
np.where
on pandas
Series
, but
pandas
often defines its own API to use instead of raw
numpy
functions, which is usually more convenient with
pd.Series
/
pd.DataFrame
.

Sure enough, I found
pandas.DataFrame.where
. However, at first glance, it has a completely different semantics. I could not find a way to rewrite the most basic example of
np.where
using pandas
where
:

# df is pd.DataFrame
# how to write this using df.where?
df['C'] = np.where((df['A']<0) | (df['B']>0), df['A']+df['B'], df['A']/df['B'])


Am I missing something obvious? Or is pandas
where
intended for a completely different use case, despite same name as
np.where
?

Answer

Try:

(df['A'] + df['B']).where((df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])

The difference between the numpy where and DataFrame where is that the default values are supplied by the DataFrame that the where method is being called on (docs).

I.e.

np.where(m, A, B)

is roughly equivalent to

A.where(m, B)

If you wanted a similar call signature using pandas, you could take advantage of the way method calls work in Python:

pd.DataFrame.where(cond=(df['A'] < 0) | (df['B'] > 0), self=df['A'] + df['B'], other=df['A'] / df['B'])

or without kwargs (Note: that the positional order of arguments is different from the numpy where argument order):

pd.DataFrame.where(df['A'] + df['B'], (df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])