Eric D. Brown Eric D. Brown -3 years ago 87
Python Question

Removing 'overlapping' dates from pandas dataframe

I have a pandas dataframe that looks like the following:

ID date close
1 09/15/07 123.45
2 06/01/08 130.13
3 10/25/08 132.01
4 05/13/09 118.34
5 11/07/09 145.99
6 11/15/09 146.73
7 07/03/11 171.10


I want to remove any rows that overlap.

Overlapping rows is defined as any row within X days of another row. For example, if X = 365. then the result should be:

ID date close
1 09/15/07 123.45
3 10/25/08 132.01
5 11/07/09 145.99
7 07/03/11 171.10


If X = 50, the result should be:

ID date close
1 09/15/07 123.45
2 06/01/08 130.13
3 10/25/08 132.01
4 05/13/09 118.34
5 11/07/09 145.99
7 07/03/11 171.10


I've taken a look at a few questions on here but haven't find the right approach. For example, Pandas check for overlapping dates in multiple rows and Fastest way to eliminate specific dates from pandas dataframe are similar but don't quite get me what I need.

I have the following ugly code in place today that works for small X values but when X gets larger (e.g., when X = 365), it removes all dates except the original date.

filter_dates = []
for index, row in df.iterrows():
if observation_time == 'D':
for i in range(1, observation_period):
filter_dates.append((index.date() + timedelta(days=i)))
df = df[~df.index.isin(filter_dates)]


Any help/pointers would be appreciated!

Clarification:

The solution to this needs to look at every row, not just the first row.

Answer Source

I just used an elementary approach (essentially it's a tweaked version of the OP's approach), no fancy numpy or pandas ops but linear instead of quadratic complexity (when compard to the distance matrix approach).
However (as Cory Madden), I assume that the data is sorted with respect to the date column. I hope it's correct:

Dataframe -> I'm using pandas index here:

import pandas as pd
df = pd.DataFrame({'date': ["2007-09-15","2008-06-01","2008-10-25",
                            "2009-05-13","2009-11-07", "2009-11-15", "2011-07-03"],
                   'close':[123.45, 130.13, 132.01, 118.34, 
                            145.99, 146.73, 171.10]})
df["date"]=pd.to_datetime(df["date"])

The following block of code can easily be wrappen in a function and computs the correct dataframe indexes for X=365:

X = 365
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
     if (row["date"] - last_day).days > X:
         filter_ids.append(index)
         last_day = row["date"]

and the result:

print(df.loc[filter_ids,:])
    close       date
0  123.45 2007-09-15
2  132.01 2008-10-25
4  145.99 2009-11-07
6  171.10 2011-07-03

note that the indexes are shifted by one due to the index starting from zero.


I just wanted to comment on linear versus quadtratic complexity My solution has linear time complexity, seeing every row of the data frame exactly once. Cory maddens solution has quadratic complexity: in each iteration every row of the data frame is accessed. However, if X (the day difference) is large, we might discard a huge part of the data set end only perform very few iterations.

To this end, one might want to consider the following worst case scenario for X=2 the data set:

df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')})

On my machine the following codes yield:

%%timeit
X = 2
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
    if (row["date"] -last_day).days > X:
        filter_ids.append(index)
        last_day = row["date"]
1 loop, best of 3: 7.06 s per loop

and

day_diffs = abs(df.iloc[0].date - df.date).dt.days
i = 0
days = 2
idx = day_diffs.index[i]
good_ids = {idx}
while True:
    try:
        current_row = day_diffs[idx] 
        day_diffs = day_diffs.iloc[1:]
        records_not_overlapping = (day_diffs - current_row) > days         
        idx = records_not_overlapping[records_not_overlapping == True].index[0] 
        good_ids.add(idx)
except IndexError:  
    break
1 loop, best of 3: 3min 16s per loop
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download