J.Tristan J.Tristan - 7 months ago 41
R Question

Fast computing solution to calculate number of rows matching a vector value

I am trying to create by specifying a dataframe and the columns used, a dataframe that would be a grid of all the possible combination of the levels of the columns selected. I also add a variable, T_weight, initialized at 0.001 to avoid problem of case not selected for future treatments, which counts for the different combinations the number of times they appear in the original dataframe.

Here is the code I am using and a simple example :

Grid_from_df <- function(df,inputs,true_weights=F){
df_levels <- sapply(df[,inputs],levels)
grid_df <- expand.grid(df_levels)
grid_df$F_weight <- rep(1,nrow(grid_df))
if (true_weights){
grid_df$T_weight <- rep(0.001,nrow(grid_df))
for (i in 1:nrow(grid_df)){
sel <- t(apply(df[,inputs],1,function(row) grid_df[i,inputs] == row))
sel2 <- apply(sel,1,function(row) sum(row)==length(row))
grid_df[i,"T_weight"] <- grid_df[i,"T_weight"] + sum(sel2)

df2 <- data.frame(X1=c(1,2,2,1),X2=c(2,2,1,1),X3=c(3,4,5,3))
df2$X1 <- as.factor(df2$X1)
df2$X2 <- as.factor(df2$X2)
df2$X3 <- as.factor(df2$X3)
df3 <- Grid_from_df(df2,c("X1","X2","X3"),T)

Even though it seems to be working on this simple example, the processing time to calculate T_weight is extremely long for the size of dataframes of my project (about 10 variables, and 300 000 rows).

Since I am rather limited in R, I was wondering if someone had a faster solution to calculate T_weight, maybe without using a for loop.

Thank you for your time


Here is your ideal answer:

A dataframe where a combination (1,2,3) appear twice

df2 <- data.frame(X1=c(1,2,2,1,1),X2=c(2,2,1,1,2),X3=c(3,4,5,3,3)) 
df2 <- data.table(df2)
df2$T_weigth <- 1

This collapses the group and calculate the number of time they appear

df2 <- df2[, list(T_weigth=sum(T_weigth)), by=c("X1", "X2", "X3")]

Create the new dataframe with all the combination ( here I use CJ (cross joint) from package data.table because it is known to be fast)

dt_temp = df2[, do.call(CJ, c(.SD, unique=TRUE)), .SDcols=c("X1", "X2", "X3")]

Merge with the original data to get the weight associate to the correct combination and replace the NA weight with 0 (again with data.table)

df3 <- merge(dt_temp, df2,by=colnames(dt_temp), all.x=TRUE)[, T_weigth := ifelse(is.na(T_weigth),0, T_weigth )]

If you do it this way with the base package, it should work also, and be faster than your solution.