RBasti RBasti - 1 month ago 9
R Question

Improve slow if else loop in R

I wrote a very simple code in R but it needs almost 2 hours when using it for data > 2.000.000 rows.

Is there any opportunity to improve the code? I would prefer a solution as easy as possible.

My R skills are okay (experience < 1 year) but I reached my limit in this case. Furthemore I read some articels about speeding up if else loops but I am not sure which strategy is most suitable for my code (e.g. Vectorise, ifelse, Parallelism, etc.)

Thanks for help.

system.time(
for (i in 1:(length(mydata$session_id)-1)){
if (mydata$session_id[i] != mydata$session_id[i+1]){
mydata$Einstiegskanal[i]="1"
} else {
mydata$Einstiegskanal[i]="0"
}
}
)

# 6877,1 Seconds = 1,91 h

Answer

It appears what you're doing is just a difference between the ids from one row to the next. diff was made for this.

session_id <- sample(1:10, size = 2000000, replace = TRUE)

system.time({
  ifelse(c(diff(session_id) == 0, NA), "1", "0")
})
   user  system elapsed 
   0.64    0.05    0.69

If you really want to speed it up, you can try avoiding the ifelse as well.

Your code would be

lgl <- c(diff(x) == 0, NA)

mydata$Einstiegskanal[!lgl] <- "1"
mydata$Einstiegskanal[lgl] <- "0"

For a comparison of speed between the two approaches:

library(microbenchmark)
session_id <- sample(1:10, size = 2000000, replace = TRUE)

y <- vector("character", length(session_id))

microbenchmark(
  with_ifelse = ifelse(c(diff(session_id) == 0, NA), "1", "0"),
  avoid_ifelse = {
    lgl <- c(diff(session_id) == 0, NA)
    y[lgl] <- "1"
    y[!lgl] <- "0"
  },
  times = 10)

Unit: milliseconds
         expr       min        lq     mean    median        uq      max neval cld
  with_ifelse 684.69879 686.16912 710.3928 714.88029 726.61384 736.1481    10   b
 avoid_ifelse  88.75335  89.21844  98.8694  90.46677  92.03064 139.8182    10  a 
Comments