Energy1937 Energy1937 - 4 months ago 9
R Question

How to speed up a loop-like function in R

In trying to avoid using the for loop in R, I wrote a function that returns an average value from one data frame given row-specific values from another data frame. I then pass this function to sapply over the range of row numbers. My function works, but it returns ~ 2.5 results per second, which is not much better than using a for loop. So, I feel like I've not fully exploited the vectorized aspects of the apply family of functions. Can anyone help me rethink my approach? Here is a minimally working example. Thanks in advance.

#Creating first dataframe
dates<-seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1)
n<-length(seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1))
df1<-data.frame(date = dates,
hour = sample(1:24, n,replace = T),
cat = sample(c("a", "b"), n, replace = T),
lag = sample(1:24, n, replace = T))

#Creating second dataframe
df2<-data.frame(date = sort(rep(dates, 24)),
hour = rep(1:24, length(dates)),
p = runif(length(rep(dates, 24)), min = -20, max = 100))

df2<-df2[order(df2$date, df2$hour),]

df2$cat<-"a"
temp<-df2
temp$cat<-"b"
df2<-rbind(df2,temp)

#function
period_mean<-function(x){

tmp<-df2[df$cat == df1[x,]$cat,]

#This line extracts the row name index from tmp,
#in which the two dataframes match on date and hour
he_i<-which(tmp$date == df1[x,]$date & tmp$hour == df1[x,]$hour)

#My lagged period is given by the variable "lag". I want the average
#over the period hour - (hour - lag). Since df2 is sorted such hours
#are consecutive, this method requires that I subset on only the
#relevant value for cat (hence the creation of tmp in the first line
#of the function
p<-mean(tmp[(he_i - df1[x,]$lag):he_i,]$p)

print(x)
print(p)
return(p)
}

#Execute function
out<-sapply(1:length(row.names(df1)), period_mean)

Answer

Here's one suggestion:

getIdx <- function(i) {
    date <- df1$date[i]
    hour <- df1$hour[i]    
    cat <- df1$cat[i]
    which(df2$date==date & df2$hour==hour & df2$cat==cat)
}
v_getIdx <- Vectorize(getIdx)

df1$index <- v_getIdx(1:nrow(df1))
b_start <- match("b", df2$cat)
out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
    flr <- ifelse(x[1]=="a", 1, b_start)
    x <- as.numeric(x[2:3])
    mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
})

We make a function (getIdx) to retrieve the rows from df2 that match the values from each row in df1, and then Vectorize the function.

We then run the vectorized function to get a vector of rownames. We set b_start to be the row where the "b" category starts.

We then iterate through the rows of df1 with apply. In the mean(...) function, we set the "floor" to be either row 1 (if cat=="a") or b_start (if cat=="b"), which eliminates the need to subset (what you were doing with tmp).

Performance:

> system.time(out<-sapply(1:length(row.names(df1)), period_mean))
   user  system elapsed 
 11.304   0.393  11.917 

> system.time({
+     df1$index <- v_getIdx(1:nrow(df1))
+     b_start <- match("b", df2$cat)
+     out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
+         flr <- ifelse(x[1]=="a", 1, b_start)
+         x <- as.numeric(x[2:3])
+         mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
+     })
+ })
   user  system elapsed 
  2.839   0.405   3.274 

> all.equal(out, out2)
[1] TRUE
Comments