Hugo Honorem Hugo Honorem - 5 months ago 26
Python Question

search and find between data frames

I have this data frame looking something like this: (not necessary these dates, length or order)

date1 date2 dummy
2015-10-01 2015-09-02 1
2015-10-01 2015-09-02 1
2015-10-03 2015-09-02 0
2015-10-04 2015-09-05 0
.......... .......... .
.......... .......... .
.......... .......... .
2015-10-20 2015-11-04 1
2015-10-20 2015-11-05 1


I'm creating a new data frame containing the earliest date in 'date2' and the latest date in 'date1' and fill the period between with dates.

startdate = df['date2'].min(axis=0)
enddate = df['date1'].max(axis=0)

def perdelta(start, end, delta):
curr = start
while curr <= end:
yield curr
curr += delta

data2 =[]
for result in perdelta(startdate, enddate, timedelta(days=1)):
data2.append(result)


I would like to find every row of dates in the new data frame, match it up with 'date1' and count how many of the same dates that has a zero in 'dummy'.
I can find all the zeros and count them for specific dates with the pandas groupby

g = df.groupby(['date1'])
df3 = pd.DataFrame(g.apply(lambda x: x[x['dummy'] == 0]['dummy'].count()), columns=['all_zeros'])


But this will only find the dates in 'date1' and count the zeros, and not start from my startdate, also it will skip dates that has a one and not paste a zero (counting non zeros should paste 0).

the output I would like to get is:

date_newdf count
'startdate' 0 (cuz it does not exist in date1)
2015-09-05 0 (cuz it does not exist in date1)
.......... .
.......... .
.......... .
2015-10-01 3 (found 3 zeroz with the this date)
.......... .
'enddate' 2


etc.

to replicate:

data = {'date1': ['15-10-01', '15-10-01', '15-10-03', '15-10-04', '15-10-05', '15-10-05'],
'date2': ['15-09-02', '15-09-02', '15-09-02', '15-09-05', '15-09-05', '15-09-05'],
'dummy': [1,1,0,0,0,1]}
df = pd.DataFrame(data, columns=['date1', 'date2' , 'dummy'])

Answer

I think, your need add reindex function with list data2 to the end of your script and then fill missing data NaN to 1.

Input for better testing:

       date1      date2  dummy
0 2015-10-01 2015-09-02      1
1 2015-10-01 2015-09-02      1
2 2015-10-03 2015-09-02      0
3 2015-10-04 2015-09-05      0
4 2015-10-05 2015-11-05      0
5 2015-10-05 2015-11-05      0
6 2015-10-05 2015-11-05      0
7 2015-10-05 2015-11-05      1
8 2015-10-05 2015-11-05      1
print df3
            all_zeros
date1                
2015-10-01          0
2015-10-03          1
2015-10-04          1
2015-10-05          3

df3 = df3.reindex(pd.DatetimeIndex(data2))
df3 = df3.fillna(0)
print df3
            all_zeros
2015-09-02          0
2015-09-03          0
2015-09-04          0
2015-09-05          0
2015-09-06          0
2015-09-07          0
2015-09-08          0
2015-09-09          0
2015-09-10          0
2015-09-11          0
2015-09-12          0
2015-09-13          0
2015-09-14          0
2015-09-15          0
2015-09-16          0
2015-09-17          0
2015-09-18          0
2015-09-19          0
2015-09-20          0
2015-09-21          0
2015-09-22          0
2015-09-23          0
2015-09-24          0
2015-09-25          0
2015-09-26          0
2015-09-27          0
2015-09-28          0
2015-09-29          0
2015-09-30          0
2015-10-01          0
2015-10-02          0
2015-10-03          1
2015-10-04          1
2015-10-05          3