David Nehme David Nehme - 3 months ago 12
Python Question

Computing differences accross rows with multiple index columns

I have a dataframe with one column representing time, and additional columns representing other parts of the key.

df = pd.DataFrame(data=[(t, l1, l2, t * t * (1 + l2 + l1))
for t in range(3)
for l1 in [3, 4]
for l2 in [10, 100]],
columns=['t', 'l1', 'l2', 'x'])

t l1 l2 x
0 0 3 10 0
1 0 3 100 0
2 0 4 10 0
3 0 4 100 0
4 1 3 10 14
5 1 3 100 104
6 1 4 10 15
7 1 4 100 105
8 2 3 10 56
9 2 3 100 416
10 2 4 10 60
11 2 4 100 420


I'm looking for the difference in the 'x' column for the row with the previous value of 't', but the same values for 'l1', and 'l2'.

t l1 l2 x t.1 delta_x
0 0 3 10 0 1 NaN
1 0 3 100 0 1 NaN
2 0 4 10 0 1 NaN
3 0 4 100 0 1 NaN
4 1 3 10 14 2 14.0
5 1 3 100 104 2 104.0
6 1 4 10 15 2 15.0
7 1 4 100 105 2 105.0
8 2 3 10 56 3 42.0
9 2 3 100 416 3 312.0
10 2 4 10 60 3 45.0
11 2 4 100 420 3 315.0


I can generate this frame with the following code.

df['t.1'] = df.t + 1
df['delta_x'] = df.x - df.merge(df, left_on=['t', 'l1', 'l2'],
right_on=['t.1', 'l1', 'l2'],
how='left',
suffixes=['','.1'])['x.1']


Is there a cleaner or more efficient way to do this?

Answer

You must use groupby on l1 and l2 columns as you want to compare the difference of x column for the pair of these values(l1, l2) depending on the change in value of t column.

By default, diff computes the difference between the value of (t=1) and (t=0) grouped by l1 & l2 and returns the result. So, if you want to find the difference in x values between (t=2) and (t=0), you just need to do diff(periods=2).

And, finally use the tranform method to return the computed diffs within each group of the group chunk.

In [3]: df['delta_x'] = df.groupby(['l1', 'l2'])['x'].transform(lambda x: x.diff())

In [4]: df
Out[4]: 
    t  l1   l2    x  delta_x
0   0   3   10    0      NaN
1   0   3  100    0      NaN
2   0   4   10    0      NaN
3   0   4  100    0      NaN
4   1   3   10   14     14.0
5   1   3  100  104    104.0
6   1   4   10   15     15.0
7   1   4  100  105    105.0
8   2   3   10   28     14.0
9   2   3  100  208    104.0
10  2   4   10   30     15.0
11  2   4  100  210    105.0

Timing Constraints:

In [5]: %timeit df['delta_x'] = df.groupby(['l1', 'l2'])['x'].transform(lambda x: x.diff())
1000 loops, best of 3: 1.55 ms per loop

In [17]: %timeit df['delta_x'] = df.x - df.merge(df, left_on=['t', 'l1', 'l2'], right_on=['t.1', 'l1', 'l2'],how='left',suffixes=['','.1'])['x.1']
100 loops, best of 3: 3.33 ms per loop
Comments