Miguel Miguel - 6 months ago 16
Python Question

Find duplicates in column combo

I need to keep unique records among 2 columns.
Imagine in the dataframe (df) below, I want to delete the repeated info in columns x and y.


x y z

1 3 1

4 4 3

2 4 3

1 3 2

3 5 2


What I've done was to concatenate the XY= str(x)+str(y)and kept the unique values by pd.unique(df.XY()).
The record (1 3 1) and (1 3 2) would be duplicates.

I believe there has got to be a better way of doing this... Particularly as it comes to 3 or more columns.
Thanks,
MB

Answer

Use drop_duplicates:

print df.drop_duplicates(subset=['x','y'])
   x  y  z
0  1  3  1
1  4  4  3
2  2  4  3
4  3  5  2

You can keep first or last duplicated rows with parameter keep:

print df.drop_duplicates(subset=['x','y'])
#it is same as:
print df.drop_duplicates(subset=['x','y'], keep='first')
   x  y  z
0  1  3  1
1  4  4  3
2  2  4  3
4  3  5  2

print df.drop_duplicates(subset=['x','y'], keep='last')
   x  y  z
1  4  4  3
2  2  4  3
3  1  3  2
4  3  5  2

If you need remove all duplicates, use keep=False:

print df.drop_duplicates(subset=['x','y'], keep=False)
   x  y  z
1  4  4  3
2  2  4  3
4  3  5  2
Comments