user2320577 user2320577 - 1 month ago 6
Python Question

Conditional backfill of Pandas columns

I have the following dataframe:

DATE ID STATUS
0 2014-01-01 1 INPROGRESS
1 2013-03-01 1 ENDED
2 2015-05-01 2 INPROGRESS
3 2012-05-01 1 STARTED
4 2011-05-01 2 STARTED
5 2011-03-01 3 STARTED
6 2011-04-01 3 ENDED
7 2011-06-01 3 INPROGRESS
8 2011-09-01 3 STARTED


here the code to build it:

>>> df1 = pd.DataFrame(columns=["DATE", "ID", "STATUS"])
>>> df1["DATE"] = ['2014-01-01', '2013-03-01', '2015-05-01', '2012-05-01', '2011-05-01', '2011-03-01', '2011-04-01', '2011-06-01', '2011-09-01']
>>> df1["ID"] = [1,1,2,1,2,3,3,3,3]
>>> df1["STATUS"] = ['INPROGRESS', 'ENDED', 'INPROGRESS', 'STARTED', 'STARTED', 'STARTED','ENDED', 'INPROGRESS', 'STARTED']


for each ID group the status column represents a task that can be:

STARTED, INPROGRESS or ENDED

in this precise time order (STARTED should not came after ENDED etc..).

By grouping by ID and sorting by date I get for ID 3:

df1.sort_values('DATE')[df1['ID']==3]

DATE ID STATUS
5 2011-03-01 3 STARTED
6 2011-04-01 3 ENDED
7 2011-06-01 3 INPROGRESS
8 2011-09-01 3 STARTED


No I would need to "FIX" the status column to follow the order defined above based on the last status. For the ID 3 the last status is started so everything should be backfilled to the started status as follow:

DATE ID STATUS
5 2011-03-01 3 STARTED
6 2011-04-01 3 STARTED
7 2011-06-01 3 STARTED
8 2011-09-01 3 STARTED


For ID 1:

df1.sort_values('DATE')[df1['ID']==1]
DATE ID STATUS
3 2012-05-01 1 STARTED
1 2013-03-01 1 ENDED
0 2014-01-01 1 INPROGRESS


I would end up having the last two status INPROGRESS and leave the first as STARTED like:

df1.sort_values('DATE')[df1['ID']==1]
DATE ID STATUS
3 2012-05-01 1 STARTED
1 2013-03-01 1 INPROGRESS
0 2014-01-01 1 INPROGRESS


ID 2 has the correct order.

Any idea how can I do that with pandas ?
I am trying by grouping by ID and I am thinking about backfill based on the last status but I don't know how I could stop backfilling at the right moment.

thanks!

Answer

An classic way is to forget your statuses are labels: instead see them as strictly increasing numbers, like started 1, in progress 2 and ended 3. With such a column you can now check the monotony of these numbers per group, then backfill until you see a disruption in monotony.

Prepare your dataframe:

keymapping = {'STARTED':0, 'INPROGRESS':1, 'ENDED':2}
df['STATUS_ID'] = df.STATUS.map(keymapping)
df.set_index(['ID', 'DATE'], inplace=True)
df.sort_index(inplace=True)

Now, group by ID and use transform to get the last value of each group spread across the entire index, so that you can assign it to your dataframe as a new column:

df['STATUS_LAST'] = df.groupby(level=0, as_index=False).STATUS_ID.transform('last')

df
Out[63]: 
                   STATUS  STATUS_ID  STATUS_LAST
ID DATE                                          
1  2012-05-01     STARTED          0            1
   2013-03-01       ENDED          2            1
   2014-01-01  INPROGRESS          1            1
2  2011-05-01     STARTED          0            1
   2015-05-01  INPROGRESS          1            1
3  2011-03-01     STARTED          0            0
   2011-04-01       ENDED          2            0
   2011-06-01  INPROGRESS          1            0
   2011-09-01     STARTED          0            0

Finally, apply the backfilling by using the increasing monotony of STATUS_ID against last, i.e. each value of STATUS_ID is valid when if is lower than or equal to STATUS_LAST:

df.STATUS_ID = df.STATUS_ID.where(df.STATUS_ID <= df.STATUS_LAST, df.STATUS_LAST)
df.STATUS_ID
Out[65]: 
ID  DATE      
1   2012-05-01    0
    2013-03-01    1
    2014-01-01    1
2   2011-05-01    0
    2015-05-01    1
3   2011-03-01    0
    2011-04-01    0
    2011-06-01    0
    2011-09-01    0

Reverse map it to the labels and assign it to STATUS:

df.STATUS_ID.map({v:k for k,v in keymapping.items()})
Out[66]: 
ID  DATE      
1   2012-05-01       STARTED
    2013-03-01    INPROGRESS
    2014-01-01    INPROGRESS
2   2011-05-01       STARTED
    2015-05-01    INPROGRESS
3   2011-03-01       STARTED
    2011-04-01       STARTED
    2011-06-01       STARTED
    2011-09-01       STARTED
Name: STATUS_ID, dtype: object