user278411 user278411 - 2 months ago 12
R Question

Using sapply on a vector of dates: Function very slow. Why?

I have a very simple function that takes a POSIXct date, extracts the year, and subtracts 1 if the date is before June 1.

library(lubridate)
DetermineWaterYear <- function(date,
return.interval=FALSE){
wy <- year(date) + ifelse(month(date)>=6, 0, -1)
if(return.interval==FALSE){
return(wy)
} else {
interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
return(interval)
}
}


When I try to use sapply() to perform this function on a vector of ~190k dates, it takes FOREVER.

sapply(temp$date, DetermineWaterYear)


Furthermore, I clocked it performing the sapply on subsets of the vector from lengths 10000 to 190000, using the following code:

tempdates <- rep(ymd('1956-01-01'), 190000)


index <- seq(10000,190000,10000)
for(i in 1:length(index)){
times[i] <- system.time(sapply(tempdates[1:index[i]], DetermineWaterYear))[3]
}


The crazy thing is, as the vector of dates gets longer, the per-record processing time increases hugely... the time required to process 190k dates is 238x the time required for 10k dates. I have plenty of memory available.

Plot of # of records vs. processing time

Why is this behaving so slowly? How can I optimize it?

Answer

As has been pointed out in the comments, passing the vector of dates directly to the function is way faster. Additionally, ifelse has a ton of overhead, so substituting ifelse(month(date)>=6, 0, -1) with floor((x/5.6) - (x^2)*0.001) - 1L will be much faster.

DetermineWaterYearNew <- function(date, return.interval=FALSE){
    x <- month(date)
    wy <- year(date) + floor((x/5.6) - (x^2)*0.001) - 1L
    if(return.interval==FALSE){
        return(wy)
    } else {
        interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
        return(interval)
    }
}

Here are some benchmarks:

microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:1000]),
               OldVectorized=DetermineWaterYear(tempdates[1:1000]),
               NonVectorized=sapply(tempdates[1:1000],DetermineWaterYear))
Unit: microseconds
         expr       min         lq       mean     median         uq       max neval
NewVectorized   341.954   364.1215   418.7311   395.7300   460.7955   602.627   100
OldVectorized   417.077   437.3970   496.0585   462.8485   545.1555   802.954   100
NonVectorized 42601.719 45148.3070 46452.6843 45902.4100 47341.2415 62898.476   100

Only comparing the vectorized solutions on the full gamut of dates we have:

microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:190000]),
               OldVectorized=DetermineWaterYear(tempdates[1:190000]))
Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
NewVectorized 26.30660 27.26575 28.97715 27.84169 29.19391 102.1697   100
OldVectorized 38.98637 40.78153 44.07461 42.55287 43.77947 114.9616   100