Godders Godders -4 years ago 123
Python Question

Match 2 dataframes within a specified tolerance range of DateTimeIndex while keeping a unique identifier

I have two DataFrames:

df1
:

Time (index) Value
2016-10-28 10:27:00+00:00 5
2016-10-28 10:28:00+00:00 12
2016-10-28 10:29:00+00:00 20
2016-10-28 10:30:00+00:00 34
2016-10-28 10:31:00+00:00 1
2016-10-28 10:32:00+00:00 30
2016-10-28 10:33:00+00:00 20
2016-10-28 10:34:00+00:00 10


df2
:

Time (index) Value
2016-10-28 10:27:00+00:00 56
2016-10-28 10:31:00+00:00 72


How would I find the rows in
df1
that have a time within 1mins inclusive of a row in
df2
? So that my result is:

df3
:

Time (index) Value
2016-10-28 10:27:00+00:00 5
2016-10-28 10:28:00+00:00 12
2016-10-28 10:30:00+00:00 34
2016-10-28 10:31:00+00:00 1
2016-10-28 10:32:00+00:00 30


I've looked at all sorts of merging, joining etc. but it's all crazy stuff to a web developer!

Any help greatly appreciated.

EDIT:

How might an answer be extended to include an arbitrary identifier as follows:

df3
:

Time (index) Value Id
2016-10-28 10:27:00+00:00 5 1
2016-10-28 10:28:00+00:00 12 1
2016-10-28 10:30:00+00:00 34 2
2016-10-28 10:31:00+00:00 1 2
2016-10-28 10:32:00+00:00 30 2

Answer Source

You could use reindex to transform the index of df2 to that of df1 with the allowed tolerance limit of 1 minute by passing method='nearest' which has support for the entered tolerance range.

Drop the missing values (NaN's) from these and realign df1 based on the obtained indices later.

matches = df2.reindex(df1.index, method='nearest', tolerance=pd.Timedelta('1m'))
finite_matches = matches.dropna()
df3 = df1.loc[finite_matches.index]

enter image description here

Note:

Starting with v0.20.0, pd.merge_asof (currently it does only backward lookup) could also be used to achieve the same thing in the following manner:

pd.merge_asof(df1, df2, left_index=True, right_index=True, 
              tolerance=pd.Timedelta('1m'), direction='nearest')

Edit:

To include identifiers corresponding to similar valued groupings, you can take the difference between consecutive cells obtained from the matches dataframe after dropping nulls from it.

Wherever these do not equate to zero, their cumulative sum gets computed causing breaks (transition in outcome) at any False value encountered while traversing the boolean series.

df3.assign(Id=finite_matches.diff().ne(0).cumsum())

enter image description here

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download