DianaLog - 2 months ago 11
R Question

# Separately aggregate() cyclic condition sequencees

A data frame contains repetitive sequences of several conditions, for each repetition of condition sequence in column 'condition' I am trying to take the max of corresponding values in neighbouring column 'time_ms':

The data

``````> data.frame( time_ms=sample(20:40,12), condition=replicate(1,c('cond_one','cond_one','cond2','cond2','cond_three','cond_three')) )
time_ms  condition
1       22   cond_one
2       38   cond_one
3       26      cond2
4       23      cond2
5       35 cond_three
6       39 cond_three
7       32   cond_one
8       27   cond_one
9       30      cond2
10      34      cond2
11      20 cond_three
12      37 cond_three
``````

The desired outcome should be ideally containing only the max value per sequence once (unique values only to make further calculations easier):

``````   time_ms  condition   max_time
1       22   cond_one
2       38   cond_one   38
3       26      cond2   26
4       23      cond2
5       35 cond_three
6       39 cond_three   39
7       32   cond_one   32
8       27   cond_one
9       30      cond2
10      34      cond2   34
11      20 cond_three
12      37 cond_three   37
``````

A simple
`aggregate()`
is close:

``````> aggregate(.~condition, data=mydata, max)
condition time_ms
1   cond_one      38
2 cond_three      31
3      cond2      39
``````

but how to expand this further to treat each condition instance separately?

The example is simplified, real data has limitations:

• condition order in data can be random

• condition sequence size is not fixed

Answer

We can do this with `data.table`. Convert the 'data.frame' to 'data.table' (`setDT(mydata)`), grouped by the run-length-id of 'condition', we get the row index (`.I`) of the maximum 'time_ms', extract it as a column (`\$V1`), use that as 'i', and assign (`:=`) the corresponding value of 'time_ms' to create the new column 'max_time'

``````library(data.table)
i1 <- setDT(mydata)[, .I[which.max(time_ms)],  by = rleid(condition)]\$V1
mydata[i1, max_time := time_ms]
#    time_ms  condition max_time
# 1:      22   cond_one       NA
# 2:      38   cond_one       38
# 3:      26      cond2       26
# 4:      23      cond2       NA
# 5:      35 cond_three       NA
# 6:      39 cond_three       39
# 7:      32   cond_one       32
# 8:      27   cond_one       NA
# 9:      30      cond2       NA
#10:      34      cond2       34
#11:      20 cond_three       NA
#12:      37 cond_three       37
``````

NOTE: This will create NA for the missing values instead of blanks (`""`). By changing it to `""`, it would also change the `class` of the column to `character`. So, it is better to keep it that way.

Comments