Energy1937 - 9 months ago 50

R Question

In trying to avoid using the for loop in R, I wrote a function that returns an average value from one data frame given row-specific values from another data frame. I then pass this function to sapply over the range of row numbers. My function works, but it returns ~ 2.5 results per second, which is not much better than using a for loop. So, I feel like I've not fully exploited the vectorized aspects of the apply family of functions. Can anyone help me rethink my approach? Here is a minimally working example. Thanks in advance.

`#Creating first dataframe`

dates<-seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1)

n<-length(seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1))

df1<-data.frame(date = dates,

hour = sample(1:24, n,replace = T),

cat = sample(c("a", "b"), n, replace = T),

lag = sample(1:24, n, replace = T))

#Creating second dataframe

df2<-data.frame(date = sort(rep(dates, 24)),

hour = rep(1:24, length(dates)),

p = runif(length(rep(dates, 24)), min = -20, max = 100))

df2<-df2[order(df2$date, df2$hour),]

df2$cat<-"a"

temp<-df2

temp$cat<-"b"

df2<-rbind(df2,temp)

#function

period_mean<-function(x){

tmp<-df2[df$cat == df1[x,]$cat,]

#This line extracts the row name index from tmp,

#in which the two dataframes match on date and hour

he_i<-which(tmp$date == df1[x,]$date & tmp$hour == df1[x,]$hour)

#My lagged period is given by the variable "lag". I want the average

#over the period hour - (hour - lag). Since df2 is sorted such hours

#are consecutive, this method requires that I subset on only the

#relevant value for cat (hence the creation of tmp in the first line

#of the function

p<-mean(tmp[(he_i - df1[x,]$lag):he_i,]$p)

print(x)

print(p)

return(p)

}

#Execute function

out<-sapply(1:length(row.names(df1)), period_mean)

Answer

Here's one suggestion:

```
getIdx <- function(i) {
date <- df1$date[i]
hour <- df1$hour[i]
cat <- df1$cat[i]
which(df2$date==date & df2$hour==hour & df2$cat==cat)
}
v_getIdx <- Vectorize(getIdx)
df1$index <- v_getIdx(1:nrow(df1))
b_start <- match("b", df2$cat)
out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
flr <- ifelse(x[1]=="a", 1, b_start)
x <- as.numeric(x[2:3])
mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
})
```

We make a function (`getIdx`

) to retrieve the rows from `df2`

that match the values from each row in `df1`

, and then `Vectorize`

the function.

We then run the vectorized function to get a vector of rownames. We set `b_start`

to be the row where the "b" category starts.

We then iterate through the rows of `df1`

with `apply`

. In the `mean(...)`

function, we set the "floor" to be either row 1 (if `cat=="a"`

) or `b_start`

(if `cat=="b"`

), which eliminates the need to subset (what you were doing with `tmp`

).

Performance:

```
> system.time(out<-sapply(1:length(row.names(df1)), period_mean))
user system elapsed
11.304 0.393 11.917
> system.time({
+ df1$index <- v_getIdx(1:nrow(df1))
+ b_start <- match("b", df2$cat)
+ out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
+ flr <- ifelse(x[1]=="a", 1, b_start)
+ x <- as.numeric(x[2:3])
+ mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
+ })
+ })
user system elapsed
2.839 0.405 3.274
> all.equal(out, out2)
[1] TRUE
```