Michael Perdue Michael Perdue - 6 months ago 174
Python Question

pandas: drop duplicates in groupby 'date'

I have the following dataframe:

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'

df=pd.read_csv(url)

df.groupby('date').cid.size()

date
2005 7
2006 237
2007 3610
2008 1318
2009 2664
2010 997
2011 6390
2012 2904
2013 7875
2014 3979

df.groupby('date').cid.nunique()

date
2005 3
2006 10
2007 227
2008 52
2009 142
2010 57
2011 219
2012 99
2013 238
2014 146
Name: cid, dtype: int64


I would like to eliminate the duplicate
cid
values such that the output from
df.groupby('date').cid.size()
matches the output from
df.groupby('date').cid.nunique()
. I have looked at this post but it does not seem to have a solid solution to the problem.

I have tried the following:

df.groupby([df['date']]).drop_duplicates(cols='cid')


But I get this error:

AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method


and this:

df.groupby(('date').drop_duplicates('cid'))


But I get this error:

AttributeError: 'str' object has no attribute 'drop_duplicates'


Does someone have an idea on this?

Answer

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
Comments