DGraham - 2 months ago 9

Python Question

I have the following dataframe:

`region pop_1 pop_1_source pop_2 pop_2_source pop_3 pop_3_source`

a 99 x1 84 x2 61 x3

b 64 x1 65 x2 16 x3

c 92 x1 26 x2 6 x3

d 82 x1 60 x2 38 x3

e 45 x1 77 x2 42 x3

I can calculate the highest value found in each region through:

`df['upper_limit'] = df[['pop_1','pop_2','pop_3']].max(axis=1)`

If I only compare two populations I can then calculate the source of the highest population i.e:

`df['upper_limit_source'] = np.where(df.upper_limit == df['upper_limit'],df.pop_1,df.pop_2)`

However if I try expand this to search into all three columns, it fails to work.

I have searched for a solution, but cannot make anything work with np.where(np.logical_or or similar.

Am I missing something obvious?

Answer

I found your question a bit confusing (among other things, `df.upper_limit == df['upper_limit']`

is always true, and your "source" columns are all filled with `x1`

(except for one `1x`

which looks like a typo)).

However, it seems like you'd like to find *which* of the three columns was responsible for the maximum, then calculate a value based on this. So, to calculate the column responsible, you could use `np.argmax`

:

```
import numpy as np
idmax = np.argmax(df[['pop_1','pop_2','pop_3']].as_matrix(), axis=1)
```

This will give you, for each row, 0, 1, or 2, depending on which of the three columns was responsible for the maximum.

Now if, for example, you'd like to choose `pop_1_source`

, `pop_2_source`

, or `pop_3_source`

, according to the index, you could use `np.choose`

:

```
np.choose(idmax, df[[`pop_1_source', 'pop_2_source', pop_3_source']].as_matrix().T)
```