jason jason - 1 year ago 71
Python Question

How to sum pivot data with a DIFF in time by one of the indexes

Sorry the Title of the post isn't clear and that's probably why I'm struggling to google this in Python/Pandas. I'm wondering if I need to groupby and then diff data....

I'm trying to find the total time a person is using a web site by finding the differences between page access time in an activity log. I'm using SHIFT in python to get the previous (or next) record as a new column in a data frame. I'm then comparing the difference in time this way. This all works and I have asked about this before.

The problem I can't work out is
How to "DIFF" times in the activity log only where the times are for the same user or session. I'm currently DIFFing all records in a data frame but this is clearly wrong when the user or the session ID changes. I'm getting very high values in some places.

Here's my example in python code.

How do I only DIFF for the same "B" records and start again for a new "B" value.


# python diff summing

import urllib
import numpy as np
import pandas as pd
import time
import datetime
import math

# Using lamba as in reality I'm diffing 2 dates and returning
# difference in seconds. and checking for zero or NAN
def my_diff(a, b):
c = abs(b-a)
if c <1:
if math.isnan(c):
return c

# Required -sum the total time in seconds
# where a user (A) spends time in one session (B)
# I want the total time for all sessions per user but
# having trouble DIFFing between values *only*
# in the same session
# In real data the data is also sorted by
# user, session, datetime so no need to sort.

# A is userid
# B is session ID so will be an integer and unique
# C is access timestamp but just a random number here.
# D is irrelevant here

df = pd.DataFrame({'A' : ['user1', 'user1', 'user1', 'user1',
'user2', 'user2', 'user2', 'user2'],
'B' : ['abc123', 'random', 'jeff', 'gjgjg',
'four', 'five', 'six', 'seven'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
# This I think is wrong as I'm DIFFing next/previous record even
# though B the session ID might change
# I only want to DIFF where the session IDs are the same
# i.e. same user and same session.
df['PREVIOUS_C'] = df.C.shift(-1)
df['DIFF'] = df.apply(lambda row: my_diff(row.C, row.PREVIOUS_C), axis=1)
print df

# This is almost what I'm after but I want DIFF to only be difference
# where the session IDs (and user) are the same.
# Diffing over a change of A or B does not make sense here.
pivot = df.pivot_table(index=['A','B'], values=["DIFF"],aggfunc={np.sum},fill_value=0)

print pivot # Desired output to be saved to CSV

# pivot again to get total for each user

Answer Source

In the example you provided there are no common session IDs so your difference column will just have Null values but what you want is this:

df['diff'] = df.groupby(['A','B'])['C'].transform(lambda x:x - x.shift(-1))

If you want to see difference of C by just Username and not username + session ID you would do something similar:

df['diff'] = df.groupby(['A'])['C'].transform(lambda x:x - x.shift(-1))