J.J.Marko J.J.Marko - 6 months ago 11
Python Question

Python Pandas Identify Duplicated rows with Additional Column

I have the following

Dataframe
:

df
Out[23]:
PplNum RoomNum Value
0 1 0 265
1 1 12 170
2 2 0 297
3 2 12 85
4 2 0 41
5 2 12 144


Generally the
PplNum
and
RoomNum
is generated like this, and it will always follow this format:

for ppl in [1,2,2]:
for room in [0, 12]:
print(ppl, room)

1 0
1 12
2 0
2 12
2 0
2 12


But now what I would like to achieve is to mark those duplicates combinations of
PplNum
and
RoomNum
so that I can know which combinationss are the first occurrence, which are the second occurrence and so on... So the expected output Dataframe will be like this:

PplNum RoomNum Value C
0 1 0 265 1
1 1 12 170 1
2 2 0 297 1
3 2 12 85 1
4 2 0 41 2
5 2 12 144 2

Answer

you can do it using groupby() together with cumcount() function:

In [102]: df['C'] = df.groupby(['PplNum','RoomNum']).cumcount() + 1

In [103]: df
Out[103]:
   PplNum  RoomNum  Value  C
0       1        0    265  1
1       1       12    170  1
2       2        0    297  1
3       2       12     85  1
4       2        0     41  2
5       2       12    144  2

Explanation:

In [101]: df.groupby(['PplNum','RoomNum']).cumcount() + 1
Out[101]:
0    1
1    1
2    1
3    1
4    2
5    2
dtype: int64

Group-By examples