user278411 - 1 year ago 111
R Question

# Using sapply on a vector of dates: Function very slow. Why?

I have a very simple function that takes a POSIXct date, extracts the year, and subtracts 1 if the date is before June 1.

``````library(lubridate)
DetermineWaterYear <- function(date,
return.interval=FALSE){
wy <- year(date) + ifelse(month(date)>=6, 0, -1)
if(return.interval==FALSE){
return(wy)
} else {
interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
return(interval)
}
}
``````

When I try to use sapply() to perform this function on a vector of ~190k dates, it takes FOREVER.

``````sapply(temp\$date, DetermineWaterYear)
``````

Furthermore, I clocked it performing the sapply on subsets of the vector from lengths 10000 to 190000, using the following code:

``````tempdates <- rep(ymd('1956-01-01'), 190000)

index <- seq(10000,190000,10000)
for(i in 1:length(index)){
times[i] <- system.time(sapply(tempdates[1:index[i]], DetermineWaterYear))[3]
}
``````

The crazy thing is, as the vector of dates gets longer, the per-record processing time increases hugely... the time required to process 190k dates is 238x the time required for 10k dates. I have plenty of memory available.

Why is this behaving so slowly? How can I optimize it?

As has been pointed out in the comments, passing the vector of dates directly to the function is way faster. Additionally, `ifelse` has a ton of overhead, so substituting `ifelse(month(date)>=6, 0, -1)` with `floor((x/5.6) - (x^2)*0.001) - 1L` will be much faster.

``````DetermineWaterYearNew <- function(date, return.interval=FALSE){
x <- month(date)
wy <- year(date) + floor((x/5.6) - (x^2)*0.001) - 1L
if(return.interval==FALSE){
return(wy)
} else {
interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
return(interval)
}
}
``````

Here are some benchmarks:

``````microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:1000]),
OldVectorized=DetermineWaterYear(tempdates[1:1000]),
NonVectorized=sapply(tempdates[1:1000],DetermineWaterYear))
Unit: microseconds
expr       min         lq       mean     median         uq       max neval
NewVectorized   341.954   364.1215   418.7311   395.7300   460.7955   602.627   100
OldVectorized   417.077   437.3970   496.0585   462.8485   545.1555   802.954   100
NonVectorized 42601.719 45148.3070 46452.6843 45902.4100 47341.2415 62898.476   100
``````

Only comparing the vectorized solutions on the full gamut of dates we have:

``````microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:190000]),
OldVectorized=DetermineWaterYear(tempdates[1:190000]))
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
NewVectorized 26.30660 27.26575 28.97715 27.84169 29.19391 102.1697   100
OldVectorized 38.98637 40.78153 44.07461 42.55287 43.77947 114.9616   100
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download