confused00 confused00 - 4 months ago 17
Python Question

Use Pandas dataframe to add lag feature from MultiIindex Series

I have a MultiIndex Series (3 indices) that looks like this:

Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................


I also have a dataframe
df
which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe
df
that contains the value matching the
ID_1
and
ID_2
and the
Week - 2
from the Series.

For example, for the row in dataframe that has
ID_1 = 26
,
ID_2 = 1182
and
Week = 3
, I want to match the value in the Series indexed by
ID_1 = 26
,
ID_2 = 1182
and
Week = 1
(3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.

Right now, I am trying to do this by using:

[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]


This however is very slow and memory hungry and I was wondering what are some better ways to do this.

FWIW, the Series was created using

saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()


and I'm willing to do it a different way if better paths exist to create what I'm looking for.

Answer

Increase the Week by 2:

saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2

and then merge df with saved_groupby:

result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')

This will augment df with the target median from 2 weeks ago. To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:

result['Median'] = result['Median'].fillna(0)

For example,

import numpy as np
import pandas as pd
np.random.seed(2016)

df = pd.DataFrame(np.random.randint(5, size=(20,5)), 
                  columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])

saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})

result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)

yields

    Week  ID_1  ID_2  Target  Foo  Median
0      3     2     3       4    2     0.0
1      3     3     0       3    4     0.0
2      4     3     0       1    2     0.0
3      3     4     1       1    1     0.0
4      2     4     2       0    3     2.0
5      1     0     1       4    4     0.0
6      2     3     4       0    0     0.0
7      4     0     0       2    3     0.0
8      3     4     3       2    2     0.0
9      2     2     4       0    1     0.0
10     2     0     4       4    2     0.0
11     1     1     3       0    0     0.0
12     0     1     0       2    0     0.0
13     4     0     4       0    3     4.0
14     1     2     1       3    1     0.0
15     3     0     1       3    4     2.0
16     0     4     2       2    4     0.0
17     1     1     4       4    2     0.0
18     4     1     0       3    0     0.0
19     1     0     1       0    0     0.0
Comments