MarkM MarkM - 3 months ago 7
Python Question

How to group a data frame while containing information about two rows?

I'm new to Python and I hope someone can help me with this performance issue.
My data looks like:

TIMESTAMP A
34 2050-09-08 03:00:00 EST 3.0
40 2050-09-08 07:00:00 EST 3.0
67 2050-09-08 17:00:00 EST 6.0
84 2050-09-08 23:00:00 EST 6.0
89 2050-09-09 01:00:00 EST 11.0
103 2050-09-09 07:00:00 EST 10.0
110 2050-09-09 11:00:00 EST 10.0
118 2050-09-09 15:00:00 EST 10.0


I want get the time intervals in which the values in column A are Steady (S), Increasing (I) or Decreasing (D).

At this moment, I use a for-loop to compare the rows and calculate the slope between these values. As long as the sign of the slope does not change for every iteration, the end timestamp of the interval gets updated. This results in intervals like Interval(begin, end, state). The result for the example above would be:

Interval(2050-09-08 03:00:00 EST, 2050-09-08 07:00:00 EST, S)
Interval(2050-09-08 07:00:00 EST, 2050-09-08 17:00:00 EST, I)
Interval(2050-09-08 17:00:00 EST, 2050-09-08 23:00:00 EST, S)
etc.


Since the dataset contains many rows and columns, I'm trying to find a way to code this more efficiently (without a for-loop).

data['slope'] = compute_slopes(data)
data['state'].apply(lambda x: get_state(x))
data["shift"] = data["state"].shift(1)
data["check"] = data["state"] != data["shift"]
data["group"] = data["check"].cumsum()
begin_group = data.groupby("group").first()
end_group = data.groupby("group").last()
result = pd.concat([begin_group, end_group])
result = result.sort_values('TIMESTAMP')

def compute_slopes(data):
next_df = data.shift(-1)
return getSlope(pd.to_datetime(df['TIMESTAMP'], format = '%Y-%m-%d %H:%M:%S EST'), df['A'], pd.to_datetime(next_df['TIMESTAMP'], format = '%Y-%m-%d %H:%M:%S EST'), next_df['A'])


def get_slope(x1, y1, x2, y2):
return (y2 - y1) / ((x2 - x1).dt.total_seconds()/60)


def get_state(slope):
if(slope < 0):
state = 'D' #DECREASING
elif(slope == 0):
state = 'S' #STEADY
else:
state = 'I' #INCREASING

return state


The code above results in something like below, but grouping this data frame does not work since there is one state which belongs to two timestamps (state S belongs to 03:00:00 and 07:00:00).

TIMESTAMP A slope state
34 2050-09-08 03:00:00 EST 3.0 0.000000 S
40 2050-09-08 07:00:00 EST 3.0 0.005000 I
67 2050-09-08 17:00:00 EST 6.0 0.000000 S
84 2050-09-08 23:00:00 EST 6.0 0.041667 I
89 2050-09-09 01:00:00 EST 11.0 -0.002778 D
103 2050-09-09 07:00:00 EST 10.0 0.000000 S
110 2050-09-09 11:00:00 EST 10.0 0.000000 S
118 2050-09-09 15:00:00 EST 10.0 0.000000 S


In some way, I want to group these states and get the begin- and end timestamp for each state and save it in an interval. Does anyone know a quicker way than just looping through the data frame?

Many thanks in advance!

Answer

This should be helpful. Use lots of shift and then use groupby + agg.

df.loc[df.A < df.A.shift(-1), 'State'] = 'I'
df.loc[df.A > df.A.shift(-1), 'State'] = 'D'
df.loc[df.A == df.A.shift(-1).ffill(), 'State'] = 'S'
df['StateGroup'] = (df.State != df.State.shift()).cumsum()
df['NextTIMESTAMP'] = df.TIMESTAMP.shift(-1).ffill()
df

enter image description here

aggs = dict(A=['mean', 'count', 'first', 'last'], State=['first'],
            TIMESTAMP={'Start': 'first'}, NextTIMESTAMP={'End': 'last'})
df.groupby('StateGroup').agg(aggs)

enter image description here

Comments