DianaLog DianaLog - 2 months ago 11
R Question

Separately aggregate() cyclic condition sequencees

A data frame contains repetitive sequences of several conditions, for each repetition of condition sequence in column 'condition' I am trying to take the max of corresponding values in neighbouring column 'time_ms':

The data

> data.frame( time_ms=sample(20:40,12), condition=replicate(1,c('cond_one','cond_one','cond2','cond2','cond_three','cond_three')) )
time_ms condition
1 22 cond_one
2 38 cond_one
3 26 cond2
4 23 cond2
5 35 cond_three
6 39 cond_three
7 32 cond_one
8 27 cond_one
9 30 cond2
10 34 cond2
11 20 cond_three
12 37 cond_three


The desired outcome should be ideally containing only the max value per sequence once (unique values only to make further calculations easier):

time_ms condition max_time
1 22 cond_one
2 38 cond_one 38
3 26 cond2 26
4 23 cond2
5 35 cond_three
6 39 cond_three 39
7 32 cond_one 32
8 27 cond_one
9 30 cond2
10 34 cond2 34
11 20 cond_three
12 37 cond_three 37


A simple
aggregate()
is close:

> aggregate(.~condition, data=mydata, max)
condition time_ms
1 cond_one 38
2 cond_three 31
3 cond2 39


but how to expand this further to treat each condition instance separately?

The example is simplified, real data has limitations:


  • condition order in data can be random

  • condition sequence size is not fixed


Answer

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(mydata)), grouped by the run-length-id of 'condition', we get the row index (.I) of the maximum 'time_ms', extract it as a column ($V1), use that as 'i', and assign (:=) the corresponding value of 'time_ms' to create the new column 'max_time'

library(data.table)
i1 <- setDT(mydata)[, .I[which.max(time_ms)],  by = rleid(condition)]$V1
mydata[i1, max_time := time_ms]
#    time_ms  condition max_time
# 1:      22   cond_one       NA
# 2:      38   cond_one       38
# 3:      26      cond2       26
# 4:      23      cond2       NA
# 5:      35 cond_three       NA
# 6:      39 cond_three       39
# 7:      32   cond_one       32
# 8:      27   cond_one       NA
# 9:      30      cond2       NA
#10:      34      cond2       34
#11:      20 cond_three       NA
#12:      37 cond_three       37

NOTE: This will create NA for the missing values instead of blanks (""). By changing it to "", it would also change the class of the column to character. So, it is better to keep it that way.

Comments