PeterL PeterL - 2 months ago 17
Python Question

Finding and counting duplicates based on other columns

I have following table in Pandas:

Date System Action
20.08.2017 A 9:01:01
20.08.2017 B 9:05:11
20.08.2017 B 9:05:11
20.08.2017 B 9:10:00
20.08.2017 B 9:05:11
21.08.2017 A 8:03:05
21.08.2017 A 8:03:05
22.09.2017 A 8:05:00
22.09.2017 B 9:05:11
22.09.2017 B 9:05:11
22.09.2017 B 9:05:11


I would like to find duplicities in “Action time” but only for the same system and date. And than mark them to Seq column. Thus results should look like this:

Date System Action Seq
20.08.2017 A 9:01:01 1
20.08.2017 B 9:05:11 1
20.08.2017 B 9:05:11 2
20.08.2017 B 9:10:00 1
20.08.2017 B 9:05:11 3
21.08.2017 A 8:03:05 1
21.08.2017 A 8:03:05 2
22.09.2017 A 8:05:00 1
22.09.2017 B 9:05:11 1
22.09.2017 B 9:05:11 2
22.09.2017 B 9:05:11 3


The first occurrence of time for system “A” in new day has “1”. All following entries for the same day and same system have +1. The counter is separate for all days and all systems.

Can you please hint me with some pythonic way how to do this? All I can think of is a loop over all data. But the data set is quite large…

Answer Source

This looks like a job for df.groupby followed by dfGroupBy.cumcount:

In [1018]: df.assign(Seq=df.groupby(['Date', 'System', 'Action']).cumcount() + 1)
Out[1018]: 
          Date System   Action  Seq
0   20.08.2017      A  9:01:01    1
1   20.08.2017      B  9:05:11    1
2   20.08.2017      B  9:05:11    2
3   20.08.2017      B  9:10:00    1
4   20.08.2017      B  9:05:11    3
5   21.08.2017      A  8:03:05    1
6   21.08.2017      A  8:03:05    2
7   22.09.2017      A  8:05:00    1
8   22.09.2017      B  9:05:11    1
9   22.09.2017      B  9:05:11    2
10  22.09.2017      B  9:05:11    3