Jeff Saltfist Jeff Saltfist - 1 month ago 8
Python Question

Remove Quasi Duplicates In Pandas

I have a Pandas data frame that looks as follows:

import pandas as pd
data = pd.read_csv('C:\Users\Frank\Desktop\\10-25-16-54-7-IMPORT.csv', index_col=False)
print data.head(10)

Date Symbol
0 2015-03-18 01:54:35 UTC NKTR -0.290
1 2015-03-18 02:10:49 UTC DRQ -0.082
2 2015-03-18 03:03:10 UTC NKTR -0.290
3 2015-03-18 03:13:17 UTC UAM 0.414
4 2015-03-18 03:48:24 UTC ROCK 0.000
5 2015-03-18 03:56:30 UTC ROCK 0.000
6 2015-03-18 04:52:24 UTC MTZ -0.290
7 2015-03-18 05:00:29 UTC NKTR -0.290
8 2015-03-18 05:04:31 UTC NKTR -0.290
9 2015-03-18 05:29:48 UTC PSEC -0.046


I want to remove every row with a duplicate symbol (in this case "NKTR") that occurs subsequent to the first instance of that same symbol on that same day. Is this possible?

(removing duplicates will not work because of the different time stamp of the rows).

Answer

You can try groupby() the date of the Date column and Symbol then take the first row of each group:

import pandas as pd
df.groupby([pd.to_datetime(df.Date).dt.date, 'Symbol'], as_index=False).first()

#  Symbol                      Date  Value
#0    DRQ   2015-03-18 02:10:49 UTC -0.082
#1    MTZ   2015-03-18 04:52:24 UTC -0.290
#2   NKTR   2015-03-18 01:54:35 UTC -0.290
#3   PSEC   2015-03-18 05:29:48 UTC -0.046
#4   ROCK   2015-03-18 03:48:24 UTC  0.000
#5    UAM   2015-03-18 03:13:17 UTC  0.414
Comments