AF7 AF7 - 1 year ago 59
R Question

Fast rolling mean + summarize

In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. The example below will hopefully be clear.
I have tried several approaches, and the fastest up to now seems to be using

from the package
for the running mean, and
for picking the maximum.
Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; this is preferred.

#Example data frame of 10k measurements from 2001 to 2014
n <- 100000
df <- data.frame(rawdata=rnorm(n),
year=sort(sample(2001:2014, size=n, replace=TRUE))

ww <- 1:120 #Vector of window widths

dfsumm <-, ncol=121))
dfsumm[,1] <- 2001:2014
colnames(dfsumm) <- c("year", paste0("D=", ww))

system.time(for (i in 1:length(ww)) {
#Do the rolling mean for this ww
df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA)
#Aggregate maxima for each year
dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2]
}) #28s on my machine

This gives the desired output: a
with 15 rows (years from 2001 to 2015) and 120 columns (the window widths) containing the maximum for each ww and for each year.

However, it still takes too long to compute (as I have to compute thousands of these). I have tried playing around with other options, namely
, but I've been unable to find something faster due to my lack of knowledge of those packages.

Which would be the fastest way to do this, using a single core (the code is already parallelized elsewhere)?

Answer Source

Memory management, i.e. allocation and copies, is killing you with your approach.

Here is a data.table approach, which assigns by reference:

alloc.col(df, 200) #allocate sufficient columns

#assign rolling means in a loop
for (i in seq_along(ww)) 
  set(df, j = paste0("D", i),  value = roll_mean(df[["rawdata"]], 
                                        ww[i], na.rm=TRUE, fill=NA))

dfsumm <- df[, lapply(.SD, max, na.rm = TRUE), by = year] #aggregate