Mukul Mukul - 1 month ago 9
Python Question

python Iterative loop through columns of dataframe

Working on a problem, I have the following dataframe in python

week hour week_hr store_code baskets
0 201616 106 201616106 505 0
1 201616 107 201616107 505 0
2 201616 108 201616108 505 0
3 201616 109 201616109 505 18
4 201616 110 201616110 505 0
5 201616 106 201616108 910 0
6 201616 107 201616106 910 0
7 201616 108 201616107 910 2
8 201616 109 201616108 910 3
9 201616 110 201616109 910 10


Here "hour" variable is a concat of "weekday" and "hour of shop", example weekday is monday=1 and hour of shop is 6am then hour variable = 106, similarly cal_hr is a concat of week and hour. I want to get those rows where i see a trend of no baskets , i.e 0 baskets for rolling 3 weeks. in the above case i will only get the first 3 rows. i.e. for store 505 there is a continuous cycle of 1 baskets from 106 to 108. But i do not want the rows (4,5,6) because even though there are 0 baskets for 3 continuous hours but the hours are actually NOT continuous. 110 -> 106 -> 107 . For the hours to be continuous they should lie in the range of 106 - 110.. Essentially i want all stores and the respective rows if it has 0 baskets for continuous 3 hours on any given day. Dummy output

week hour week_hr store_code baskets
0 201616 106 201616106 505 0
1 201616 107 201616107 505 0
2 201616 108 201616108 505 0


Can i do this in python using pandas and loops? The dataset requires sorting by store and hour. Completely new to python (

Answer

Do the following:

  1. Sort by store_code, week_hr
  2. Filter by 0
  3. Store the subtraction between df['week_hr'][1:].values-df['week_hr'][:-1].values so you will get to know if they are continuos.
  4. Now you can give groups to continuous and filter as you want.

    import numpy as np
    import pandas as pd
    
    # 1
    t1 = df.sort_values(['store_code', 'week_hr'])
    
    # 2
    t2 = t1[t1['baskets'] == 0]
    
    # 3
    continuous = t2['week_hr'][1:].values-t2['week_hr'][:-1].values == 1
    groups = np.cumsum(np.hstack([False, continuous==False]))
    t2['groups'] = groups
    
    # 4
    t3 = t2.groupby(['store_code', 'groups'], as_index=False)['week_hr'].count()
    t4 = t3[t3.week_hr > 2]
    print pd.merge(t2, t4[['store_code', 'groups']])
    

There's no need for looping!

Comments