Pythonista anonymous Pythonista anonymous - 6 months ago 10
Python Question

Python: Running maximum by another column?

I have a dataframe like this, which tracks the value of certain items (ids) over time:

mytime=np.tile( np.arange(0,10) , 2 )
myids=np.repeat( [123,456], [10,10] )
myvalues=np.random.random_integers(20,30,10*2)

df=pd.DataFrame()
df['myids']=myids
df['mytime']=mytime
df['myvalues']=myvalues



+-------+--------+----------+--+--+
| myids | mytime | myvalues | | |
+-------+--------+----------+--+--+
| 123 | 0 | 29 | | |
+-------+--------+----------+--+--+
| 123 | 1 | 23 | | |
+-------+--------+----------+--+--+
| 123 | 2 | 26 | | |
+-------+--------+----------+--+--+
| 123 | 3 | 24 | | |
+-------+--------+----------+--+--+
| 123 | 4 | 25 | | |
+-------+--------+----------+--+--+
| 123 | 5 | 29 | | |
+-------+--------+----------+--+--+
| 123 | 6 | 28 | | |
+-------+--------+----------+--+--+
| 123 | 7 | 21 | | |
+-------+--------+----------+--+--+
| 123 | 8 | 20 | | |
+-------+--------+----------+--+--+
| 123 | 9 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 0 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 1 | 24 | | |
+-------+--------+----------+--+--+
| 456 | 2 | 20 | | |
+-------+--------+----------+--+--+
| 456 | 3 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 4 | 29 | | |
+-------+--------+----------+--+--+
| 456 | 5 | 29 | | |
+-------+--------+----------+--+--+
| 456 | 6 | 24 | | |
+-------+--------+----------+--+--+
| 456 | 7 | 21 | | |
+-------+--------+----------+--+--+
| 456 | 8 | 27 | | |
+-------+--------+----------+--+--+
| 456 | 9 | 29 | | |
+-------+--------+----------+--+--+


I'd need to calculate the running maximum for each id.

np.maximum.accumulate()


would calculate the running maximum regardless of id, whereas I need a similar calculation, which however resets every time the id changes. I can think of a simple script to do it in numba (I have very large arrays and non-vectorised non-numba code would be slow), but is there an easier way to do it?

With just two values I can run:

df['running max']= np.hstack(( np.maximum.accumulate(df[ df['myids']==123 ]['myvalues']) , np.maximum.accumulate(df[ df['myids']==456 ]['myvalues']) ) )


but this is not feasible with lots and lots of values.

Thanks!

Answer

Here you go. Assumption is mytime is sorted.

mytime=np.tile( np.arange(0,10) , 2 )
myids=np.repeat( [123,456], [10,10] )
myvalues=np.random.random_integers(20,30,10*2)

df=pd.DataFrame()
df['myids']=myids
df['mytime']=mytime
df['myvalues']=myvalues

groups = df.groupby('myids')
df['run_max_group'] = groups['myvalues'].transform(np.maximum.accumulate)

Output...

    myids  mytime  myvalues  run_max_group
0     123       0        27             27
1     123       1        21             27
2     123       2        24             27
3     123       3        25             27
4     123       4        22             27
5     123       5        20             27
6     123       6        20             27
7     123       7        30             30
8     123       8        24             30
9     123       9        22             30
10    456       0        29             29
11    456       1        23             29
12    456       2        30             30
13    456       3        28             30
14    456       4        26             30
15    456       5        25             30
16    456       6        28             30
17    456       7        27             30
18    456       8        20             30
19    456       9        24             30
Comments