sshr sshr - 1 month ago 8
Python Question

Finding duplicate rows python

I have

timestamp
and
id
variables in my dataframe (
df
)

timestamp id
2016-06-09 8:33:37 a1
2016-06-09 8:33:37 a1
2016-06-09 8:33:38 a1
2016-06-09 8:33:39 a1
2016-06-09 8:33:39 a1
2016-06-09 8:33:37 b1
2016-06-09 8:33:38 b1


Each
id
can't have two timestamps. I have to print these duplicate timestamps for each
id
. In my above case, the output should be for rows 1,2,4,5

The following code will give the duplicate
timestamp


set([x for x in df['timestamp'] if df['timestamp'].count(x) > 1])


How to consider
id
along with
timestamp
to have the duplicate rows?

Answer

You can use groupby and get mask of all duplicates values per group by Series.duplicated. Last use boolean indexing:

print (df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False)))
0     True
1     True
2    False
3     True
4     True
5    False
6    False
Name: timestamp, dtype: bool

print (df[df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False))])
            timestamp  id
0 2016-06-09 08:33:37  a1
1 2016-06-09 08:33:37  a1
3 2016-06-09 08:33:39  a1
4 2016-06-09 08:33:39  a1
Comments