user2242044 user2242044 - 6 days ago 6
Python Question

Covert Date Range to Numpy Array as part of Groupby in Pandas

I have a

Pandas
dataframe
queried from a database that has three columns. A start date, end date, and a person. The data may not make a lot of sense, but is just a simplified example.

startdate enddate person
0 2016-01-01 2016-01-02 A
1 2016-01-03 2016-01-03 A
2 2016-01-01 2016-01-01 B
3 2016-01-02 2016-01-02 B


Over a given date range, I'd like to know what days are covered by the entries in my
dataframe
for each person. My thought it to create an
numpy
array
with a length equal to number of days in the range. If that particular day falls in the range, the value of that index in the array is set to 1, otherwise 0. I can then flatten with a
groupby
and
lambda
function.

So given the above dataframe, and a date range of
2016-01-01
to
2016-01-03
, the final result would be:

date_binary
person
A [1, 1, 1]
B [1, 1, 0]


I've been able to work out some of the code (the
groupby
part), but am not sure how to go from a date range to an array. So in the full example below, I just hardcoded the transformed dataframe. I suppose, I could have asked a simplier question just around the part with the problem, but I know with
Pandas
there is often a very compact way to do things, so I am posting even the working part.

import pandas as pd
from datetime import datetime
import numpy as np
# initial dataset
df = pd.DataFrame(data=[['2016-01-01', '2016-01-02', 'A'],
['2016-01-03', '2016-01-03', 'A'],
['2016-01-01', '2016-01-01', 'B'],
['2016-01-02', '2016-01-02', 'B']],
columns=['startdate', 'enddate', 'person'])

# convert columns to dates
df['startdate']= pd.to_datetime(df['startdate'], format='%Y-%m-%d')
df['enddate']= pd.to_datetime(df['enddate'], format='%Y-%m-%d')

# define period for which the matrix should be created
start_date = datetime(month=01, day=1, year=2016)
end_date = datetime(month=1, day=10, year=2016)


######################
# Unsure how to do this
#####################

# what the dataframe should look like
df = pd.DataFrame(data=[[[1, 1, 0], 'A'],
[[0, 0, 1], 'A'],
[[1, 0, 0], 'B'],
[[0, 1, 0], 'B']],
columns=['date_binary', 'person'])

# flatten by person
df = df.groupby('person').aggregate(lambda x: tuple(x))

# take the max value
df.date_binary = df.date_binary.apply(lambda x: np.array([max(i) for i in zip(*x)]))

print df

Answer

I think you can apply custom function with reindex by your date_range which return new_index and indexer. Last need replace in indexer values -1 to 0 and another values to 1 by numpy.where:

# define period for which the matrix should be created
start_date = datetime(month=1, day=1, year=2016)
end_date = datetime(month=1, day=3, year=2016)

dr = pd.date_range(start_date, end_date) 

def f(x):
    arr = pd.date_range(x.startdate, x.enddate).reindex(dr)[1]
    return pd.Series([np.where(arr == -1, 0, 1)])

df['date_binary'] = df.apply(f, axis=1)
df = df[['date_binary', 'person']]
print (df)
  date_binary person
0   [1, 1, 0]      A
1   [0, 0, 1]      A
2   [1, 0, 0]      B
3   [0, 1, 0]      B