user2242044 - 8 months ago 61

Python Question

I have a

`Pandas`

`dataframe`

`startdate enddate person`

0 2016-01-01 2016-01-02 A

1 2016-01-03 2016-01-03 A

2 2016-01-01 2016-01-01 B

3 2016-01-02 2016-01-02 B

Over a given date range, I'd like to know what days are covered by the entries in my

`dataframe`

`numpy`

`array`

`groupby`

`lambda`

So given the above dataframe, and a date range of

`2016-01-01`

`2016-01-03`

`date_binary`

person

A [1, 1, 1]

B [1, 1, 0]

I've been able to work out some of the code (the

`groupby`

`Pandas`

`import pandas as pd`

from datetime import datetime

import numpy as np

# initial dataset

df = pd.DataFrame(data=[['2016-01-01', '2016-01-02', 'A'],

['2016-01-03', '2016-01-03', 'A'],

['2016-01-01', '2016-01-01', 'B'],

['2016-01-02', '2016-01-02', 'B']],

columns=['startdate', 'enddate', 'person'])

# convert columns to dates

df['startdate']= pd.to_datetime(df['startdate'], format='%Y-%m-%d')

df['enddate']= pd.to_datetime(df['enddate'], format='%Y-%m-%d')

# define period for which the matrix should be created

start_date = datetime(month=01, day=1, year=2016)

end_date = datetime(month=1, day=10, year=2016)

######################

# Unsure how to do this

#####################

# what the dataframe should look like

df = pd.DataFrame(data=[[[1, 1, 0], 'A'],

[[0, 0, 1], 'A'],

[[1, 0, 0], 'B'],

[[0, 1, 0], 'B']],

columns=['date_binary', 'person'])

# flatten by person

df = df.groupby('person').aggregate(lambda x: tuple(x))

# take the max value

df.date_binary = df.date_binary.apply(lambda x: np.array([max(i) for i in zip(*x)]))

print df

Answer Source

I think you can `apply`

custom function with `reindex`

by your `date_range`

which return `new_index`

and `indexer`

. Last need replace in `indexer`

values `-1`

to `0`

and another values to `1`

by `numpy.where`

:

```
# define period for which the matrix should be created
start_date = datetime(month=1, day=1, year=2016)
end_date = datetime(month=1, day=3, year=2016)
dr = pd.date_range(start_date, end_date)
def f(x):
arr = pd.date_range(x.startdate, x.enddate).reindex(dr)[1]
return pd.Series([np.where(arr == -1, 0, 1)])
df['date_binary'] = df.apply(f, axis=1)
df = df[['date_binary', 'person']]
print (df)
date_binary person
0 [1, 1, 0] A
1 [0, 0, 1] A
2 [1, 0, 0] B
3 [0, 1, 0] B
```