TinaW TinaW - 3 months ago 16
Python Question

Python - Count consecutive frequencies by group

I have a sequence of e-mails ordered by timestamp and user_id.

I want to investigate how often email i was followed by email j. I'm going to display these frequencies across users in a heat map to show the most common path.

a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject2
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject1
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject1
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
2016-07-01 18:00:00,c@gmail.com,welcome
2016-07-01 19:00:02,c@gmail.com,subject1
2016-07-01 20:00:04,c@gmail.com,subject3
2016-07-01 21:14:02,c@gmail.com,subject4
2016-07-01 21:26:35,c@gmail.com,subject2
"""

df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df1=df1.sort_values(['email','timestamp'])


sorted df1:

timestamp email subject
1 2016-07-01 02:01:02 a@gmail.com welcome
3 2016-07-01 08:14:02 a@gmail.com subject1
0 2016-07-01 10:17:00 a@gmail.com subject2
2 2016-07-01 14:45:04 a@gmail.com subject3
4 2016-07-01 16:26:35 a@gmail.com subject4
6 2016-07-01 02:01:02 b@gmail.com welcome
8 2016-07-01 08:14:02 b@gmail.com subject2
5 2016-07-01 10:17:00 b@gmail.com subject1
7 2016-07-01 14:45:04 b@gmail.com subject3
9 2016-07-01 16:26:35 b@gmail.com subject4
10 2016-07-01 18:00:00 c@gmail.com welcome
11 2016-07-01 19:00:02 c@gmail.com subject1
12 2016-07-01 20:00:04 c@gmail.com subject3
13 2016-07-01 21:14:02 c@gmail.com subject4
14 2016-07-01 21:26:35 c@gmail.com subject2


The output should look like this

welcome subject1 subject2 subject3 subject4
welcome 0
subject1 2 0
subject2 1 1 0
subject3 0 2 1 0
subject4 0 0 0 3 0


In other words, there were 2 occurrences where subject1 followed after a welcome email. There was 1 occurrence where subject 2 followed after a welcome message, etc.

What is the best way of doing this?

Answer

You can do a crosstab:

import pandas as pd
pd.crosstab(df1['subject'].iloc[1:].values, df1['subject'].iloc[:-1].values)
Out: 
col_0     subject1  subject2  subject3  subject4  welcome
row_0                                                    
subject1         0         1         0         0        2
subject2         1         0         0         1        1
subject3         2         1         0         0        0
subject4         0         0         3         0        0
welcome          0         0         0         2        0