MarcelKlockman MarcelKlockman - 1 month ago 12
Python Question

python perform an operation by group

I have the following dataframe with 13 different brands (xx, yy, zz etc), all with 365 rows per brand (a full years worth of observations by brand). This is the result df.head().

brand exMarketplaces incMarketplaces availabilityRate
date
2015-08-01 xx 54372.8601 65826.384700 0.260209
2015-08-02 xx 74335.6785 89722.492300 0.255547
2015-08-03 xx 78563.0134 110132.200751 0.262365
2015-08-04 xx 84414.0774 104982.190444 0.261942
2015-08-05 xx 70499.8071 87368.524093 0.263748


I am trying to trim the outliers on a group level, based on the exMarketplaces column. The following code does so for the entire dataframe (regardless of brand).

df_clean = df[np.abs(df.exMarketplaces-df.exMarketplaces.mean())<=(3*df.exMarketplaces.std())]


However I would like to complete this operation on a brand level. So in effect, this operation should be run 13 times, for each of the 13 brands, and the result will be a single dataframe object with all outliers removed.

Can anybody suggest a way of doing this? I have the gut feeling it is quite a simple thing to do and that I am missing a trick here.

Thanks in advance.

Answer

Perform Groupby w.r.t the brand column and run the function on the column of interest using apply and finally take the filtered subset of the dataframe as shown:

func = lambda x: (np.abs(x-x.mean()) <= 3*x.std())
df_clean = df[df.groupby('brand')['exMarketplaces'].apply(func)]