jd1338 jd1338 - 3 months ago 12
R Question

How do I improve performance when iterating through multiple conditions in R?

Let's assume that I have a dataset with the following structure:


  • I have N products

  • I'm operating in N countries

  • I have N payment partner

  • May dataset contains of N days

  • I have N different prices that customers can choose from



For example:

customer.id <- c(1,2,3,4,5,6,7,8)
product <- c("product1","product2","product1","product2","product1","product2","product1","product2")
country <- c("country1","country2","country1","country2","country1","country2","country1","country2")
payment.partner <- c("pp1","pp2","pp1","pp2","pp1","pp2","pp1","pp2")
day <- c("day1","day2","day1","day2","day1","day2","day1","day2")
price <- c("price1","price2","price1","price2","price1","price2","price1","price2")

customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
customer.data <- data.table(customer.data)


Suppose I want to generate an aggregate out of it that, for instance, performs a forecasting algorithm for each combination. In order to do so, I identify the unique items for each condition and iterate it as follows:

unique.products <- droplevels(unique(customer.data[,product]))
unique.countries <- droplevels(unique(customer.data[,country]))
unique.payment.partners <- droplevels(unique(customer.data[,payment.partner]))
unique.days <- droplevels(unique(customer.data[,day]))
unique.prices <- droplevels(unique(customer.data[,price]))

for(i in seq_along(unique.products)){
temp.data1 <- customer.data[product==unique.products[[i]]]
for(j in seq_along(unique.countries)){
temp.data2 <- temp.data1[country==unique.countries[[j]]]
for(k in seq_along(unique.payment.partners)){
temp.data3 <- temp.data2[payment.partner==unique.payment.partners[[k]]]
for(l in seq_along(unique.days)){
temp.data4 <- temp.data3[day==unique.days[[l]]]
for(m in seq_along(unique.prices)){
temp.data5 <- temp.data4[price==unique.prices[[m]]]
if(nrow(temp.data5)!=0){
# do your calculations here
print(temp.data5)
}
}
}
}
}
}


In general, this code structure works fine, but it gets really annoying when applying real data with 5 million rows on it. I guess R is not the best language in terms of speed and performance. Of course, I have used multicore processing in the past or tried to get such an aggregate straight out of Hive or an MySQL DataWarehouse. Using another language like C++ or Python is also always an option.

However, sometimes all these options are not possible, which then always leads me to that exact same processing structure. So I'm wondering for quite a while if there is a better, respectively faster solution from a rather architectural point of view since it is known (and also becomes VERY clear when benchmarking) that for loops and frequent data subselection is very, very slow.

Grateful for all comments, hints and possible solutions!

Answer

You should read the documentation of packages you are using. Package data.table offers some excellent introductory tutorials.

customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
library(data.table)
setDT(customer.data)
customer.data[, 
              print(customer.data[.I]), #don't do this, just refer to the columns you want to work on
              by = .(product, country, payment.partner, day, price)]

Of course, generally, you wouldn't print the data.table subset here, but work directly on specific columns.

Comments