Mukul Mukul - 4 months ago 8
Python Question

Efficient group by and where clause in python

I am having weekly( say 5 weeks) sales and baskets for product and store combination, i want to find the total spend and visits of product (irrespective of store) for a particular week(say 201520)" i.e. 20th week of year 2015. the moment i select a week , there can be some product that are not sold in that week. but i do not want to remove them from my group by. Essentially i want all the products sold in 5 weeks , but if a product is not sold in the week i selected above , i want it to be present in my final dataFrame with aggregated numbers to be 0. Sample data.(lets assume prod 122 is not sold in week 201520)

prod store week baskets sales
123 112 201518 20 100.45
123 112 201519 21 89.65
123 112 201520 22 1890.54
122 112 201518 10 909.99


sample output (for 201520)

prod total_baskets total_sales spend_per_basket
123 22 1890.54 85.93363636
122 0 0 0


i know this can be done using groupby using pandas. But i am doing multiple steps. i am looking for a more pythonic and efficient way. currently

i am first selecting the week for which i am doing groupby.

then creating a list of all product present in my inital weekly dataset.

then remerging back with the group by data. I find this non - efficient. please help. Also need to create spend per basket . if total_baskets > 0 then spend_per_basket is total_sales/ total_baskets. else 0 TIA.
dummy code:

trans_for_my_week=weekly_trans[weekly_trans['week']==201520]
avg_sales=pd.DataFrame(trans_for_my_week.groupby(['prod']).agg({'baskets': {'total_baskets':'sum'},
'sales' :{'total_sales':'sum'}}))
avg_sales_period_0.columns=avg_sales_period_0.columns.droplevel(0)
avg_sales_period_0=avg_sales_period_0.reset_index()


and so on

Answer

UPDATE2: adding new calculated multi-level column:

In [229]: x = res.sales / res.baskets

In [230]: x
Out[230]:
week   201518    201519     201520
prod
122   90.9990       NaN        NaN
123    5.0225  4.269048  85.933636

In [231]: x.columns = pd.MultiIndex.from_product(['spend_per_basket', res.columns.get_level_values(1).drop_duplicates()])

In [232]: x
Out[232]:
     spend_per_basket
               201518    201519     201520
prod
122           90.9990       NaN        NaN
123            5.0225  4.269048  85.933636

In [234]: res = res.join(x)

In [235]: res
Out[235]:
     baskets                 sales                 spend_per_basket
week  201518 201519 201520  201518 201519   201520           201518    201519     201520
prod
122       10      0      0  909.99   0.00     0.00          90.9990       NaN        NaN
123       20     21     22  100.45  89.65  1890.54           5.0225  4.269048  85.933636

PS here is very well documented multiindexing (multi-level) pandas techniques with lots of examples

UPDATE: inspired by @JoeR's solution - here is modified pivot_table() version:

res = df.pivot_table(index='prod', columns='week', values=['baskets','sales'], aggfunc='sum', fill_value=0)

In [189]: res
Out[189]:
     baskets                 sales
week  201518 201519 201520  201518 201519   201520
prod
122       10      0      0  909.99   0.00     0.00
123       20     21     22  100.45  89.65  1890.54

In [190]: res[[('baskets',201519)]]
Out[190]:
     baskets
week  201519
prod
122        0
123       21

In [192]: res.ix[122, [('sales',201519)]]
Out[192]:
       week
sales  201519    0.0
Name: 122, dtype: float64

you can also flatten your column levels as follows:

In [194]: res2 = res.copy()

In [196]: res2.columns = ['{0[0]}_{0[1]}'.format(col) for col in res2.columns]

In [197]: res2
Out[197]:
      baskets_201518  baskets_201519  baskets_201520  sales_201518  sales_201519  sales_201520
prod
122               10               0               0        909.99          0.00          0.00
123               20              21              22        100.45         89.65       1890.54

but i would keep it as multilevel column so you could use advanced indexing (like in the example above)

OLD answer:

I'd calculate it once for all your data:

from itertools import product

In [165]: %paste
g = df.groupby(['week', 'prod']).agg({'baskets':'sum', 'sales':'sum'}).reset_index()
al = pd.DataFrame(list(product(df['prod'].unique(), df.week.unique())), columns=['prod','week'])

res = pd.merge(al, g, on=['prod','week'], how='left').fillna(0)
## -- End pasted text --

In [166]: res
Out[166]:
   prod    week    sales  baskets
0   123  201518   100.45     20.0
1   123  201519    89.65     21.0
2   123  201520  1890.54     22.0
3   122  201518   909.99     10.0
4   122  201519     0.00      0.0
5   122  201520     0.00      0.0