AF7 - 7 months ago 34

R Question

In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. The example below will hopefully be clear.

I have tried several approaches, and the fastest up to now seems to be using

`roll_mean`

`RcppRoll`

`aggregate`

Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; this is preferred.

`#Example data frame of 10k measurements from 2001 to 2014`

n <- 100000

df <- data.frame(rawdata=rnorm(n),

year=sort(sample(2001:2014, size=n, replace=TRUE))

)

ww <- 1:120 #Vector of window widths

dfsumm <- as.data.frame(matrix(nrow=14, ncol=121))

dfsumm[,1] <- 2001:2014

colnames(dfsumm) <- c("year", paste0("D=", ww))

system.time(for (i in 1:length(ww)) {

#Do the rolling mean for this ww

df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA)

#Aggregate maxima for each year

dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2]

}) #28s on my machine

dfsumm

This gives the desired output: a

`data.frame`

However, it still takes too long to compute (as I have to compute thousands of these). I have tried playing around with other options, namely

`dplyr`

`data.table`

Which would be the fastest way to do this,

Answer

Memory management, i.e. allocation and copies, is killing you with your approach.

Here is a data.table approach, which assigns by reference:

```
library(data.table)
setDT(df)
alloc.col(df, 200) #allocate sufficient columns
#assign rolling means in a loop
for (i in seq_along(ww))
set(df, j = paste0("D", i), value = roll_mean(df[["rawdata"]],
ww[i], na.rm=TRUE, fill=NA))
dfsumm <- df[, lapply(.SD, max, na.rm = TRUE), by = year] #aggregate
```