RageQuilt RageQuilt - 2 months ago 27
Python Question

Conditional based on slope between two rows in Pandas DataFrame

I am trying to create a program that will select rows in a large time series dataframe and return only the rows where the slope (row2-row1)/(Time2-Time1) is >= the slope of the previous two rows (row1-row0)/(Time1-Time0). I have gone through some very heavy loop operations only to meet the extent for loops.

Time A B C
1.731806 NA NA 6.715383
1.794554 NA 6.899219 6.952729
1.79434 6.689599 6.903747 6.860664
1.928432 6.962717 7.126891 7.115988
2.000463 7.226573 7.330078 7.362645
2.090469 7.444833 7.632159 7.575328
2.175347 7.709757 7.836567 7.663173
2.765631 8.904291 8.970686 8.894602
2.905104 9.018817 8.997333 8.97221
2.980718 8.796641 8.793764 8.637462
3.058576 9.050172 9.118444 9.174402


I am hoping there is a simple vectorized approach to this but I would also like to take this a step further, where I would like to return values where the slope (row3-row2)/(Time3-Time2) >= the average of the two previous slopes between rows ([(row2-row1)/(Time2-Time1)]+[row1-row0)/(Time1-Time0)])/2 OR just (row2-row0)/(Time2-Time0).

The data frame is also very large so I am hoping to iterate this function across many columns (A, B, C...Z), which is why I'm hoping for a vectorized approach, I don't care all that much about the speed of the whole operation.

I would really appreciate help to any part of the question. It is to be used to to omit data for microbial growth that is entering the stationary phase (when the slope dies down) in very large data sets. FYI data points that are NA exist because of a row conditional I created to filter out the lag phase of microbial growth.

Thank you very very much!!!

Answer

Let's start by calculating the dataframe of slopes:

slopes = df.iloc[:, 1:].diff().div(df.Time.diff(), axis=0)

This uses diff or difference of every row with the one prior and divides each column that isn't the first by the first.

And we might as well calculate the three period slopes while we're at it.

three_period_slopes = df.iloc[:, 1:].diff(2).div(df.Time.diff(2), axis=0)

Now, you mentioned doing this for each column and that you wanted it vectorized. So, were going to calculate the dataframe of truth values that you can use later to slice.

slope >= the slope of the previous two rows

gt_prior_slope = slopes >= slopes.shift()

slope >= the average of the two previous slopes

gt_prior_2_slope_avg = slopes >= slopes.rolling(2).mean().shift()

slope >= the prior three period slope

gt_prior_3_period_slope = slopes >= three_period_slopes.shift()

We use them like this:

Column 'C' slopes >= previous slope

df.C[gt_prior_slope.C]

2     6.860664
4     7.362645
7     8.894602
10    9.174402
Name: C, dtype: float64