Stergios Stergios - 16 days ago 5
Python Question

Add successive rows in Pandas if they match on some columns

I have a dataframe like the following one:

ID URL seconds
1 Email 9
1 Email 3
1 App 5
1 App 9
1 Faceboook 50
1 Faceboook 7
1 Faceboook 39
1 Faceboook 10
1 Email 39
1 Email 5
1 Email 57
1 Faceboook 7
1 Faceboook 32
1 Faceboook 3
2 App 11
2 App 10
2 Email 56
2 Faceboook 9
2 Faceboook 46
2 Faceboook 16
2 Email 21


I want to sum the 'seconds' column for successive views of the same URL by the same ID. That's the result I'm looking for:

ID URL seconds
1 Email 12
1 App 14
1 Faceboook 106
1 Email 101
1 Faceboook 42
2 App 21
2 Email 56
2 Faceboook 71
2 Email 21


df.groupBy(['ID', 'URL']).sum()
would not work in this case as it would sum all cases of the same URL for the same ID, not only the successive ones.

Any ideas?

Answer

You can use groupby by Series created by compare by ne column URL and shifted, last use cumsum with boolean mask:

print ((df.URL.ne(df.URL.shift())).cumsum())
0     1
1     1
2     2
3     2
4     3
5     3
6     3
7     3
8     4
9     4
10    4
11    5
12    5
13    5
14    6
15    6
16    7
17    8
18    8
19    8
20    9
Name: URL, dtype: int32
print (df['seconds'].groupby([(df.URL.ne(df.URL.shift())).cumsum(), df.ID, df.URL]).sum())

URL  ID  URL      
1    1   Email         12
2    1   App           14
3    1   Faceboook    106
4    1   Email        101
5    1   Faceboook     42
6    2   App           21
7    2   Email         56
8    2   Faceboook     71
9    2   Email         21
Name: seconds, dtype: int64

print (df['seconds'].groupby([(df.URL.ne(df.URL.shift())).cumsum(), df.ID, df.URL])
                    .sum()
                    .reset_index(level=0, drop=True)
                    .reset_index())

   ID        URL  seconds
0   1      Email       12
1   1        App       14
2   1  Faceboook      106
3   1      Email      101
4   1  Faceboook       42
5   2        App       21
6   2      Email       56
7   2  Faceboook       71
8   2      Email       21  
Comments