dashiell dashiell - 3 months ago 5x
Python Question

Sum datetime differences based on column value

I have a dataframe that looks like:

field1 field2 field3
t1 1 1 1
t2 1 1 0
t3 2 3 1
t4 3 3 0
t5 1 2 0

Times are in the form
yyyy-mm-dd hh:mm:ss
, and are currently indexing the dataframe.

field 1
field 2
are used to identify items, such that the tuple
corresponds to a specific sensor somewhere in the world.
field 3
is the value of that sensor at the given time, and takes either the value 0 or 1.

I'd like the group the dataframe by (field1, field2) and sum the total time that each sensor takes each value from field 3. So, if
t1='2016-07-20 00:00:00'
t2='2016-07-20 00:01:00'
, and the current time is
'2016-07-20 00:03:00'
, I would have a new dataframe that looks like:

field3=0 field3=1
(1,1) 2 min 1 min
(2,3) ... ...
(3,3) ... ...
(1,2) ... ...

I assume that from
's value is 1, and from
onwards it is 0 because (1,1) doesn't appear again in the dataframe. The
1 min
is from
t2 - t1
and the
2 min
is from
current_time - t2

2 min
1 min
can be any format (be it total minutes/seconds, a timedelta, or whatever)

I've tried the following:

import pandas as pd
from collections import defaultdict, namedtuple

# so i can create a defaultdict(Field3) and save some logic
class Field3(object):
def __init__(self):
self.zero= pd.Timedelta('0 days')
self.one = pd.Timedelta('0 days')

# used to map to field3 in a dictionary
Sensor = namedtuple('Sensor','field1 field2')

# the dataframe mentioned above
df = pd.DataFrame(...)

# iterate through each row of the dataframe and map from (field1,field2) to
# field3, adding time based on the value of field3 in the frame and the
# time difference between this row and the next
rows = list(df.iterrows())
sensor_to_field3 = defaultdict(Field3)
for i in xrange(len(rows)-1):
sensor = Sensor(field1=rows[i][1][0],field2=rows[i][1][1])
if rows[i][1][2]: sensor_to_field3[spot].one += rows[i+1][0]-rows[i][0]
else: spot_to_status[spot].zero += rows[i+1][0]-rows[i][0]
spot_to_status = {k:[v] for k,v in sensor_to_field3.iteritems()}
result = pd.DataFrame(sensor_to_field3,index=[0])

It gets me basically but I want (though it currently only works when there's a single sensor represented in the entire table, which I don't really want to have to deal with if there's a better way of solving this).

I feel like there has to be a better way of going about this. Something like do a groupby on
, then aggregate timedeltas based on
and the
index, but I'm not sure how to go about doing that.


Managed to get it, in case anyone else runs into something remotely similar. Still not sure if it's optimal, but it feels better than what I was doing.

I changed the original dataframe to include the time as a column, and just use integer indices.

def create_time_deltas(dataframe):
    # add a timedelta column
    dataframe['timedelta'] = pd.Timedelta(minutes=0)
    # iterate over each row and set the timedelta to the difference of the next one and this one
    for i in dataframe.index[:-1]:
    # set the last time value, which couldn't be set earlier because index out of bounds
    return dataframe

def group_by_field3_time(dataframe, start=None, stop=None):
    # optionally set time bounds on what to care about
    stop = stop or pd.to_datetime(datetime.now())
    recent = dataframe.loc[logical_and(start < df['time'] , df['time'] < stop)]
    # groupby and apply to create a new dataframe with the time_deltas column 
    by_td = df.groupby(['field1','field2']).apply(create_time_deltas)
    # sum the timedeltas for each triple, which can be used later
    by_oc = by_td.groupby(['field1','field2','field3']).sum()
    return by_oc

If anyone can think of a better way to do this I'm all ears, but this does feel a lot better than creating dictionaries all over the place.