i have a dataframe
id store val1 val2
1 abc 20 30
1 abc 20 40
1 qwe 78 45
2 dfd 34 45
2 sad 43 45
id
total_store
unique stores
non-repeating_stores
id total_store unique stores non-repeating_stores
1 3 2 1
2 2 2 2
df.groupby('id')['store'].count()
You can use a groupby
+ agg
.
df = df.groupby('id').store.agg(['count', 'nunique', \
lambda x: x.drop_duplicates(keep=False).size])
df.columns = ['total_store', 'unique stores', 'non-repeating_stores']
df
total_store unique stores non-repeating_stores
id
1 3 2 1
2 2 2 2
For older pandas versions, passing a dict allows simplifying your code (deprecated in 0.20
and onwards):
agg_funcs = {'total_stores' : 'count', 'unique_stores' : 'nunique',
'non-repeating_stores' : lambda x: x.drop_duplicates(keep=False).size
}
df = df.groupby('id').store.agg(agg_funcs)
df
total_stores non-repeating_stores unique_stores
id
1 3 1 2
2 2 2 2
As a slight improvement with speed, you can employ the use of drop_duplicates
' sister method, duplicated
, in this fashion, as documented by jezrael:
lambda x: (~x.duplicated(keep=False)).sum()
This would replace the third function in agg
, with a 20% speed boost over large data of size 1000000
:
1 loop, best of 3: 7.31 s per loop
v/s
1 loop, best of 3: 5.19 s per loop