Marco Marco - 17 days ago 5
Python Question

Fastest way to find which of two lists of columns of each row is true in a pandas dataframe

I'm looking for the fastest way to do the following:

We have a pd.DataFrame:

df = pd.DataFrame({
'High': [1.3,1.2,1.1],
'Low': [1.3,1.2,1.1],
'High1': [1.1, 1.1, 1.1],
'High2': [1.2, 1.2, 1.2],
'High3': [1.3, 1.3, 1.3],
'Low1': [1.3, 1.3, 1.3],
'Low2': [1.2, 1.2, 1.2],
'Low3': [1.1, 1.1, 1.1]})


That looks like:

In [4]: df
Out[4]:
High High1 High2 High3 Low Low1 Low2 Low3
0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1
1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1
2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1


What I want to know is which one of the High1, High2, High3 float values is the first that is larger or equal to the High value. If there is none, it should be np.nan

And the same for the Low1, Low2, Low3 value, but in this case which one of them is the first that is lower or equal to the High value. If there is none, it should be np.nan

At the end I need to know which one, Low or High has come first.

One way to solve this is in a weird and not too performant way is:

df['LowIs'] = np.nan
df['HighIs'] = np.nan

for i in range(1,4):
df['LowIs'] = np.where((np.isnan(df['LowIs'])) & (
df['Low'] >= df['Low'+str(i)]), i, df['LowIs'])
df['HighIs'] = np.where((np.isnan(df['HighIs'])) & (
df['High'] <= df['High'+str(i)]), i, df['HighIs'])

df['IsFirst'] = np.where(
df.LowIs < df.HighIs,
'Low',
np.where(df.LowIs > df.HighIs, 'High', 'None')
)


Which gives me:

In [8]: df
Out[8]:
High High1 High2 High3 Low Low1 Low2 Low3 LowIs HighIs IsFirst
0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 1.0 3.0 Low
1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1 2.0 2.0 None
2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 3.0 1.0 High


As I have to do this over and over again in many iterations where High/Low will be different, performance when doing this is key.

So I wouldn't mind if the High1, High2, High3 and Low1, Low2, Low3 would be in a separate DataFrame that is transposed or if it would be in a dict or whatever. So the process to prepare the data in whatever gives the best possible performance can be slow and awkward.

One solution I worked on but just couldn't get finished to work in a vectorized way and that also seems quite slow is:

df.loc[(df.index == 0), 'HighIs'] = np.where(
df.loc[(df.index == 0), ['High1', 'High2', 'High3']] >= 1.3
)[1][0] + 1


So checking for which one of the columns it is true in that first row and then looking at the index number of np.where().

Looking forward to any suggestions and hope to learn something new! :)

Answer

Here's a vectorized approach with NumPy broadcasting -

a = df.values
out1 = (a[:,1:4] >= a[:,0,None]).argmax(1)+1
out2 = (a[:,5:8] <= a[:,4,None]).argmax(1)+1
df['LowIs'] = out2
df['HighIs'] = out1
df['IsFirst'] = np.where(out1!=out2,np.where(out1 > out2, 'Low', 'High'),None)

Sample output -

In [195]: df
Out[195]: 
   High  High1  High2  High3  Low  Low1  Low2  Low3  LowIs  HighIs IsFirst
0   1.3    1.1    1.2    1.3  1.3   1.3   1.2   1.1      1       3     Low
1   1.2    1.1    1.2    1.3  1.2   1.3   1.2   1.1      2       2    None
2   1.1    1.1    1.2    1.3  1.1   1.3   1.2   1.1      3       1    High