Ross Ross - 27 days ago 9
Python Question

Squash dataframe by a certain column to rows that contains first and last timestamps and mean of values

I have marked groups of data ordered by timestamps and I want to reduce it to the start timestamp and last timestamp and get the average of the values corresponding to that mark in that group. Example starting dataframe:

timestamp value mark
1 2016-11-07 11:00:00 0.781726 1
2 2016-11-07 11:03:00 0.812757 2
3 2016-11-07 11:05:00 0.845348 2
4 2016-11-07 11:07:00 0.817394 2
5 2016-11-07 11:11:00 0.760787 1
6 2016-11-07 11:13:00 0.807892 1
7 2016-11-07 11:15:00 0.812965 1
8 2016-11-07 11:18:00 0.822001 1


What I want to achieve:

start_timestamp end_timestamp (mean_)value mark
1 2016-11-07 11:00:00 2016-11-07 11:00:00 0.781726 1
2 2016-11-07 11:03:00 2016-11-07 11:07:00 0.825166 2
3 2016-11-07 11:11:00 2016-11-07 11:18:00 0.800911 1


Any idea of the best way to do this? Should I first mark each batch with a unique mark?

Answer

You need groupby by Series of unique groups from duplicated column mark and then aggregate first, last and mean:

print ((df.mark != df.mark.shift()).cumsum())
1    1
2    2
3    2
4    2
5    3
6    3
7    3
8    3
Name: mark, dtype: int32

df1 = df.groupby((df.mark != df.mark.shift()).cumsum()) \
         .agg({'timestamp': ['first','last'], 'value':'mean', 'mark': 'first'})

#reset MultiIndex in columns
df1.columns = ['_'.join(col) for col in df1.columns]
#if necessary rename columns
df1 = df1.rename(columns=({'timestamp_first':'start_timestamp', 
                           'timestamp_last':'end_timestamp',
                           'mark_first':'mark','value_mean':'(mean_)value'})) \
          .rename_axis(None)

print (df1)         
      start_timestamp       end_timestamp  mark  (mean_)value
1 2016-11-07 11:00:00 2016-11-07 11:00:00     1      0.781726
2 2016-11-07 11:03:00 2016-11-07 11:07:00     2      0.825166
3 2016-11-07 11:11:00 2016-11-07 11:18:00     1      0.800911