user2242044 user2242044 - 1 month ago 14
Python Question

Calculating percentage of days covered in a range with a groupby Pandas function

I have a simple

Pandas
dataframe
where each row represents a person and a date range. For each person, I'd like to know what percentage of days in a hard-coded range (defined by variables
period_start
and
period_end
) are covered across the various entries in the
dataframe
.

I assume there is an easy way to do this with
Pandas
, but I haven't been able to find one. I have a solution with multiple
dataframes
and several nested loops, but this is inefficient at scale. How can I do this more effectively utilizing
Pandas
? I think a
groupby
makes sense, but not sure how to do that when the ranges are across two columns and may overlap.

import pandas as pd
from datetime import datetime
df = pd.DataFrame(data=[['2016-01-01', '2016-01-31', 'A'],
['2016-02-02', '2016-02-10', 'A'],
['2016-03-01', '2016-04-01', 'A'],
['2016-01-01', '2016-03-01', 'B']],
columns=['startdate', 'enddate', 'person'])

# start and end date
period_start = datetime(year=2016, month=01, day=01)
period_end = datetime(year=2016, month=12, day=31)

# dates_dfculate totals days
total_days = (period_end-period_start).days + 1

# convert columns to dates
df['startdate']= pd.to_datetime(df['startdate'], format='%Y-%m-%d')
df['enddate']= pd.to_datetime(df['enddate'], format='%Y-%m-%d')

# create a TimeIndex dataframe with columns for each person
rng = pd.date_range(period_start, periods=total_days, freq='D')
people = list(set(df['person'].tolist()))
dates_df = pd.DataFrame(columns=[people], index=rng).fillna(False)

# loop over each date (index)
for index, row in dates_df.iterrows():

# loop over each column (person)
for person in people:
tmp = df[df['person'] == person]

# loop over each each entry for the person
for index1, row1 in tmp.iterrows():

# check if the date is date index in dates_df is within range
value = row1['startdate'] <= index <= row1['enddate']

# if it's not already set to true, set it to true
if dates_df.ix[index, person] == False and value == True:
dates_df.ix[index, person] = True

# for each person, show the percentage of days in range that are covered
for person in people:
print person, sum(dates_df[person].tolist())/float(total_days)


Desired Output:

A 0.196721311475
B 0.166666666667

Answer

This should be it, I'm guessing since you're adding 1 to the total days you want to be inclusive on the ranges, but edit it as needed :)

import pandas as pd
from datetime import datetime

df = pd.DataFrame(data=[['2016-01-01', '2016-01-31', 'A'],
                        ['2016-02-02', '2016-02-10', 'A'],
                        ['2016-03-01', '2016-04-01', 'A'],
                        ['2016-01-01', '2016-03-01', 'B']],
                  columns=['startdate', 'enddate', 'person'])

# start and end date
period_start = datetime(year=2016, month=1, day=1)
period_end = datetime(year=2016, month=12, day=31)

# convert columns to dates
df['startdate']= pd.to_datetime(df['startdate'],  format='%Y-%m-%d')
df['enddate']= pd.to_datetime(df['enddate'],  format='%Y-%m-%d')
df['days'] = df.apply(lambda x: max((min(x.enddate, period_end) - max(x.startdate, period_start)).days + 1, 0), axis=1)

#percentage of days in range by person
people_pct = df.groupby('person').apply(lambda x: x.days.sum() / ((period_end - period_start).days + 1))
print(people_pct.head())

-----------------
    person
    A    0.196721
    B    0.166667

You're on the right track - pandas groupby is great for segmenting data, but the real power comes from the .apply() function, which can do either a common math transformation (mean, std, etc) or, as in this case, a custom function.

The lambda within the apply is saying "for each row/column (depending on the axis) within the group, do this custom function and return a Series".

While this covers your question, it's still lacking in detecting unique days, so we're assuming the rows are split with no overlap as your example stated.

Comments