J.Tristan - 5 months ago 29

R Question

I am trying to create by specifying a dataframe and the columns used, a dataframe that would be a grid of all the possible combination of the levels of the columns selected. I also add a variable, T_weight, initialized at 0.001 to avoid problem of case not selected for future treatments, which counts for the different combinations the number of times they appear in the original dataframe.

Here is the code I am using and a simple example :

`Grid_from_df <- function(df,inputs,true_weights=F){`

df_levels <- sapply(df[,inputs],levels)

print(df_levels)

grid_df <- expand.grid(df_levels)

print(grid_df)

grid_df$F_weight <- rep(1,nrow(grid_df))

if (true_weights){

grid_df$T_weight <- rep(0.001,nrow(grid_df))

for (i in 1:nrow(grid_df)){

sel <- t(apply(df[,inputs],1,function(row) grid_df[i,inputs] == row))

sel2 <- apply(sel,1,function(row) sum(row)==length(row))

grid_df[i,"T_weight"] <- grid_df[i,"T_weight"] + sum(sel2)

}

}

return(grid_df)

}

df2 <- data.frame(X1=c(1,2,2,1),X2=c(2,2,1,1),X3=c(3,4,5,3))

df2$X1 <- as.factor(df2$X1)

df2$X2 <- as.factor(df2$X2)

df2$X3 <- as.factor(df2$X3)

df3 <- Grid_from_df(df2,c("X1","X2","X3"),T)

Even though it seems to be working on this simple example, the processing time to calculate T_weight is extremely long for the size of dataframes of my project (about 10 variables, and 300 000 rows).

Since I am rather limited in R, I was wondering if someone had a faster solution to calculate T_weight, maybe without using a for loop.

Thank you for your time

Answer

Here is your ideal answer:

A `dataframe`

where a combination (1,2,3) appear twice

```
df2 <- data.frame(X1=c(1,2,2,1,1),X2=c(2,2,1,1,2),X3=c(3,4,5,3,3))
library(data.table)
df2 <- data.table(df2)
df2$T_weigth <- 1
```

This collapses the group and calculate the number of time they appear

```
df2 <- df2[, list(T_weigth=sum(T_weigth)), by=c("X1", "X2", "X3")]
```

Create the new dataframe with all the combination ( here I use CJ (cross joint) from package data.table because it is known to be fast)

```
dt_temp = df2[, do.call(CJ, c(.SD, unique=TRUE)), .SDcols=c("X1", "X2", "X3")]
```

Merge with the original data to get the weight associate to the correct combination and replace the `NA`

weight with 0 (again with data.table)

```
df3 <- merge(dt_temp, df2,by=colnames(dt_temp), all.x=TRUE)[, T_weigth := ifelse(is.na(T_weigth),0, T_weigth )]
```

If you do it this way with the `base`

package, it should work also, and be faster than your solution.