ShanZhengYang ShanZhengYang - 2 months ago 8
Python Question

How to match multiple columns in pandas DataFrame for an "interval"?

I have the following pandas DataFrame:

import pandas as pd
df = pd.DataFrame('filename.csv')
print(df)

order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....


So, there is an
order
column encompasses many rows each, and then a range/interval from
start
to
end
for each row. Each row then is labeled by a certain
value
(e.g. category1, category2, etc.)

Now I have another dataframe called
key_df
. It is basically the exact same format:

import pandas as pd
key_df = pd.DataFrame(...)
print(key_df)

order start end value
1 1284 1299 category4
1 1297 1309 category9
1 1312 1369 category3
1 1345 1392 category29
1 1371 1383 category31
....
1 1471 1501 category31
...


My goal is to take the
key_df
dataframe and check whether the intervals
start:end
match any of the rows in the original dataframe
df
. If it does, this row in
df
should be labeled with the
key_df
dataframe's
value
value.

In our example above, the dataframe
df
would end up like this:

order start end value key_value
1 1342 1357 category1 category29
1 1459 1489 category7 category31
....


This is because if you look at
key_df
, the row

1 1345 1392 category29


with interval
1::1345-1392
falls in the interval
1::1342-1357
in the original
df
. Likewise, the
key_df
row:

1 1471 1501 category31


corresponds to the second row in
df
:

1 1459 1489 category7 category31


I'm not entirely sure

(1) how to accomplish this task in pandas

(2) how to scale this efficiently in pandas

One could begin with an if statement, e.g.

if df.order == key_df.order:
# now check intervals...somehow


but this doesn't take advantage of the dataframe structure. One then must check by interval, i.e. something like
(df.start =< key_df.start) && (df.end => key_df.end)


I'm stuck. What is the most efficient way to match multiple columns in an "interval" in pandas? (Creating a new column if this condition is met is then straightforward)

Answer

You can use merge with boolean indexing, but if DataFrames are large, scaling is problematic:

df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
print (df1)
    order  start   end      value  start_key  end_key   value_key
3       1   1342  1357  category1     1345.0   1392.0  category29
4       1   1342  1357  category1     1371.0   1383.0  category31
5       1   1342  1357  category1     1471.0   1501.0  category31
11      1   1459  1489  category7     1471.0   1501.0  category31

EDIT by comment:

df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
df1 = pd.merge(df, df1, on=['order','start','end', 'value'], how='left')
print (df1)
   order  start   end       value  start_key  end_key   value_key
0      1   1342  1357   category1     1345.0   1392.0  category29
1      1   1342  1357   category1     1371.0   1383.0  category31
2      1   1342  1357   category1     1471.0   1501.0  category31
3      1   1459  1489   category7     1471.0   1501.0  category31
4      1   1572  1601  category23        NaN      NaN         NaN
5      1   1587  1599   category2        NaN      NaN         NaN
6      1   1591  1639   category1        NaN      NaN         NaN
7     15    792   813  category13        NaN      NaN         NaN
8     15    892   913   category5        NaN      NaN         NaN
Comments