Constanze Vieth Constanze Vieth - 6 days ago 7
R Question

Big tasks In R, how to avoid for loops to run faster

My code is running but very very slowly. So this is a big problem and it has to run quicker. So here is the task:
I have a dataset with telecommunication records and i want to apply multiple functions on all records to each customer and put the results in a another data frame.

So

df1
is the data frame where each row has a unique customer id and columns with some profil infomations.
df2
is a very big data frame with about 800 000 telecommunications records identifyed over the customer ids. Now i want to compute e.g. the average data usage for each customer in
df2
and save the result in
df1
.

df1
looks like

df1 <- read.table(header = TRUE, sep=",",
text="CUSTOMER_ID,Age,ContractType, Gender
ID1,45,Postpaid,m
ID2,50,Postpaid,f
ID3,35,Postpaid,f
ID4,44,Postpaid,m
ID5,32,Postpaid,m
ID6,48,Postpaid,f
ID7,50,Postpaid,m
ID8,51,Postpaid,f")


df2
looks like

df2 <- read.table(header = TRUE, sep=",",
text="CUSTOMER_ID,EVENT,VOLUME, DURATION, MONTH
ID1,100,500,200,201505
ID1,50,400,150,201506
ID1,80,600,50,201507
ID2,40,800,45,201505
ID2,25,650,120,201506
ID2,65,380,250,201507
ID3,30,950,110,201505
ID3,25,630,85,201506
ID3,15,780,60,201507")


My codes is like



USAGE <- c("EVENT", "VOLUME", "DURATION") #column names of df2


list of functions i want to apply on
df2


StatFunctions <- list(
max = function(x) max(x),
mean = function(x) mean(x),
sum = function(x) sum(x)
)


In my original data set the Customer IDs are more complex so i choose this pattern search for the cutsomer ids. This is only a cut out of my code. But with the rest it is the same problem with the for loops.

func.num <- function(prefix, target.df, n) {
active.df <- get(target.df)
return(StatFunctions[[n]](active.df[grep(pattern = prefix,
x = active.df$CUSTOMER_ID), USAGE[m]]))
}


for (x in df1$CUSTOMER_ID) {
for (m in 1:length(USAGE)) {
for (n in 1:length(StatFunctions)) {
df1[df1$CUSTOMER_ID == x, paste(names(StatFunctions[n]),
USAGE[m], sep = "_")] <- func.num(prefix = x, target.df = "df2",n)
}
}
}


I know the code is very complicated and should be simplified.

And i want a data frame like this

Customer_ID Age contractType Gender max_EVENT mean_EVENT sum_EVENT ... sum_DURATION
ID1 45 Postpaid m 100 76 230 ... 400


So how can i avoid the for loops to run faster?


Answer

I would use dplyr package to summarize df2 by customer ID, then merge with df1.

df1 <- read.table(header = TRUE, sep=",",
                  text="CUSTOMER_ID,Age,ContractType, Gender
       ID1,45,Postpaid,m
       ID2,50,Postpaid,f
       ID3,35,Postpaid,f
       ID4,44,Postpaid,m
       ID5,32,Postpaid,m
       ID6,48,Postpaid,f
       ID7,50,Postpaid,m
       ID8,51,Postpaid,f")


df2 <- read.table(header = TRUE, sep=",",
                  text="CUSTOMER_ID,EVENT,VOLUME, DURATION, MONTH
   ID1,100,500,200,201505
   ID1,50,400,150,201506
   ID1,80,600,50,201507
   ID2,40,800,45,201505
   ID2,25,650,120,201506
   ID2,65,380,250,201507
   ID3,30,950,110,201505
   ID3,25,630,85,201506
   ID3,15,780,60,201507")

df1$CUSTOMER_ID <- gsub(" ", "", df1$CUSTOMER_ID)
df2$CUSTOMER_ID <- gsub(" ", "", df2$CUSTOMER_ID)

library(dplyr)
USAGE <- c("EVENT", "VOLUME", "DURATION")
FUNC  <- c("max", "mean", "sum")
dots <- lapply(USAGE, function(u) sprintf("%s(%s)", FUNC, u)) %>% unlist()
dots <- setNames(dots, sub("\\)", "", sub("\\(", "_", dots)))
sum_df <- df2 %>% group_by(CUSTOMER_ID) %>%
  summarize_(.dots = dots) %>% 
  ungroup()

df1$CUSTOMER_ID <- as.character(df1$CUSTOMER_ID)
sum_df$CUSTOMER_ID <- as.character(sum_df$CUSTOMER_ID)
df1 <- left_join(df1, sum_df)
Comments