DGraham DGraham - 3 months ago 16
Python Question

multiple conditions for np.where python pandas

I have the following dataframe:

region pop_1 pop_1_source pop_2 pop_2_source pop_3 pop_3_source
a 99 x1 84 x2 61 x3
b 64 x1 65 x2 16 x3
c 92 x1 26 x2 6 x3
d 82 x1 60 x2 38 x3
e 45 x1 77 x2 42 x3

I can calculate the highest value found in each region through:

df['upper_limit'] = df[['pop_1','pop_2','pop_3']].max(axis=1)

If I only compare two populations I can then calculate the source of the highest population i.e:

df['upper_limit_source'] = np.where(df.upper_limit == df['upper_limit'],df.pop_1,df.pop_2)

However if I try expand this to search into all three columns, it fails to work.
I have searched for a solution, but cannot make anything work with np.where(np.logical_or or similar.

Am I missing something obvious?


I found your question a bit confusing (among other things, df.upper_limit == df['upper_limit'] is always true, and your "source" columns are all filled with x1 (except for one 1x which looks like a typo)).

However, it seems like you'd like to find which of the three columns was responsible for the maximum, then calculate a value based on this. So, to calculate the column responsible, you could use np.argmax:

import numpy as np

idmax = np.argmax(df[['pop_1','pop_2','pop_3']].as_matrix(), axis=1)

This will give you, for each row, 0, 1, or 2, depending on which of the three columns was responsible for the maximum.

Now if, for example, you'd like to choose pop_1_source, pop_2_source, or pop_3_source, according to the index, you could use np.choose:

np.choose(idmax, df[[`pop_1_source', 'pop_2_source', pop_3_source']].as_matrix().T)