newbie - 1 year ago 88
R Question

# Find mean difference by group on variables using data.table

Supposed, I have the following

`data.table`

``````library(data.table)
dt <- data.table(x1 = c(1:12), x2=c(21:32))
``````

Then, I create bins by user-specified intervals using the following:

``````dt[,intx1:=cut(x1, breaks = c(-Inf, 4, 9, Inf))]
``````

return,

``````    x1 x2    intx1
1:  1 21 (-Inf,4]
2:  2 22 (-Inf,4]
3:  3 23 (-Inf,4]
4:  4 24 (-Inf,4]
5:  5 25    (4,9]
6:  6 26    (4,9]
7:  7 27    (4,9]
8:  8 28    (4,9]
9:  9 29    (4,9]
10: 10 30 (9, Inf]
11: 11 31 (9, Inf]
12: 12 32 (9, Inf]
``````

I am trying to find the mean difference between bins and variable:

``````dt[, mux1_grp:=mean(x1), by = intx1][,mux1_pop:=mean(x1)][,mux1_diff:=mux1_grp-mux1_pop]
dt[,`:=`(intx1=NULL, mux1_grp=NULL, mux1_pop=NULL)]
``````

The return is:

``````    x1 x2 mux1_diff
1:  1 21      -4.0
2:  2 22      -4.0
3:  3 23      -4.0
4:  4 24      -4.0
5:  5 25       0.5
6:  6 26       0.5
7:  7 27       0.5
8:  8 28       0.5
9:  9 29       0.5
10: 10 30       4.5
11: 11 31       4.5
12: 12 32       4.5
``````

However, my original data contain several variables (e.g., x1, x2,...,x20).

So, I have to repeat the same procedure for x2 as follows:

``````dt[,intx2:=cut(x2, breaks = c(-Inf, 25, 28, Inf))]
dt[, mux2_grp:=mean(x2), by = intx2][,mux2_pop:=mean(x2)][,mux2_diff:=mux2_grp-mux2_pop]
dt[,`:=`(intx2=NULL, mux2_grp=NULL, mux2_pop=NULL)]
``````

My final output will be:

``````    x1 x2 mux1_diff mux2_diff
1:  1 21      -4.0      -3.5
2:  2 22      -4.0      -3.5
3:  3 23      -4.0      -3.5
4:  4 24      -4.0      -3.5
5:  5 25       0.5      -3.5
6:  6 26       0.5       0.5
7:  7 27       0.5       0.5
8:  8 28       0.5       0.5
9:  9 29       0.5       4.0
10: 10 30       4.5       4.0
11: 11 31       4.5       4.0
12: 12 32       4.5       4.0
``````

How can I improve this code? Note that each variable has different user-specified intervals

We can do this in a compact one-step option (though it may be not that optimal when compared to the OP's method (from @Frank's comments)

``````dt[, mu_diff := mean(x) - mean(dt\$x), by = .(cut(x, breaks = c(-Inf, 4, 9, Inf)))][]
#    x    mu_diff
#1:  1 -3.8636364
#2:  2 -3.8636364
#3:  3 -3.8636364
#4:  4 -3.8636364
#5:  5  0.3863636
#6:  6  0.3863636
#7:  7  0.3863636
#8:  9  0.3863636
#9: 10  4.6363636
#10:11  4.6363636
#11:12  4.6363636
``````

If there are many variables (it is not clear whether we use the same `breaks` in `cut` or not for different columns - assuming that it is the same), we can loop through the columns (in the reproducible example below, two variables 'x1' and 'x2' are showed), specify the `.SDcols` by the numeric index of the column, grouped by the `cut` of the subsetted column, we assign the new column as the difference between the `mean` of the values within the group and the `mean` of the entire column.

``````nm1 <- paste0("mu_diff", seq_along(dt1))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = c(-Inf, 4, 9, Inf))) ,
.SDcols = j][]
}
``````

Update - Suppose if the `breaks` for `cut` variable is different for each column, place it in a `list` and get that `list` element inside the `for` loop with the indexing.

``````brkLst <- list(c(-Inf, 4, 9, Inf), c(-Inf, 10, 14, Inf))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = brkLst[[j]])) ,
.SDcols = j][]
}
``````

Checking the output with OP's new data ('dt2')

``````brkLst2 <- list(c(-Inf, 4, 9, Inf),  c(-Inf, 25, 28, Inf))
nm1 <- paste0("mu", names(dt2), "_diff")
for(j in seq_along(dt2)){
dt2[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt2[[names(dt2)[j]]]),
by = .(cut(get(names(dt2)[j]), breaks = brkLst2[[j]])) ,
.SDcols = j][]
}

dt2
#    x1 x2 mux1_diff mux2_diff
# 1:  1 21      -4.0      -3.5
# 2:  2 22      -4.0      -3.5
# 3:  3 23      -4.0      -3.5
# 4:  4 24      -4.0      -3.5
# 5:  5 25       0.5      -3.5
# 6:  6 26       0.5       0.5
# 7:  7 27       0.5       0.5
# 8:  8 28       0.5       0.5
# 9:  9 29       0.5       4.0
#10: 10 30       4.5       4.0
#11: 11 31       4.5       4.0
#12: 12 32       4.5       4.0
``````

### data

``````dt1 <- data.table(x1 = c(1,2,3,4,5,6,7,9,10,11,12))[, x2 := x1 + 5][]
#OP's changed dataset
dt2 <- data.table(x1 = 1:12, x2=21:32)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download