Chris vCB Chris vCB - 1 month ago 7
Python Question

Split a pandas dataframe by a list of values from another data frame

I'm pretty sure there's a really simple solution for this and I'm just not realising it. However...

I have a data frame of high-frequency data. Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.

As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour.

Answer

Use searchsorted:

A['group'] = B['timestamp'].searchsorted(A['timestamp'])

For each value in A['timestamp'], an index value is returned. That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order.

For example,

import numpy as np
import pandas as pd
np.random.seed(2016)

N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
#    timestamp
# 0   1.739869
# 1   2.467790
# 2   2.863659
# 3   3.295505
# 4   5.106419
# 5   6.872791
# 6   7.080834
# 7   9.909320
# 8  11.027117
# 9  12.383085

A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)

yields

   timestamp  group
0   0.896705      0
1   1.626945      0
2   2.410220      1
3   3.151872      3
4   3.613962      4
5   4.256528      4
6   4.481392      4
7   5.189938      5
8   5.937064      5
9   6.562172      5

Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (i.e. 1.739869). The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (i.e. 1.739869) but smaller than B['timestamp'][1] (i.e. 2.467790).


You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp']. Use

B['timestamp'].searchsorted(A['timestamp'], side='left')

if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]. Use

B['timestamp'].searchsorted(A['timestamp'], side='right')

if you want searchsorted to return i+1 in that situation. If you don't specify side, then side='left' is used by default.