newbie - 6 months ago 40

R Question

Supposed, I have the following

`data.table`

`library(data.table)`

dt <- data.table(x1 = c(1:12), x2=c(21:32))

Then, I create bins by user-specified intervals using the following:

`dt[,intx1:=cut(x1, breaks = c(-Inf, 4, 9, Inf))]`

return,

`x1 x2 intx1`

1: 1 21 (-Inf,4]

2: 2 22 (-Inf,4]

3: 3 23 (-Inf,4]

4: 4 24 (-Inf,4]

5: 5 25 (4,9]

6: 6 26 (4,9]

7: 7 27 (4,9]

8: 8 28 (4,9]

9: 9 29 (4,9]

10: 10 30 (9, Inf]

11: 11 31 (9, Inf]

12: 12 32 (9, Inf]

I am trying to find the mean difference between bins and variable:

`dt[, mux1_grp:=mean(x1), by = intx1][,mux1_pop:=mean(x1)][,mux1_diff:=mux1_grp-mux1_pop]`

dt[,`:=`(intx1=NULL, mux1_grp=NULL, mux1_pop=NULL)]

The return is:

`x1 x2 mux1_diff`

1: 1 21 -4.0

2: 2 22 -4.0

3: 3 23 -4.0

4: 4 24 -4.0

5: 5 25 0.5

6: 6 26 0.5

7: 7 27 0.5

8: 8 28 0.5

9: 9 29 0.5

10: 10 30 4.5

11: 11 31 4.5

12: 12 32 4.5

However, my original data contain several variables (e.g., x1, x2,...,x20).

So, I have to repeat the same procedure for x2 as follows:

`dt[,intx2:=cut(x2, breaks = c(-Inf, 25, 28, Inf))]`

dt[, mux2_grp:=mean(x2), by = intx2][,mux2_pop:=mean(x2)][,mux2_diff:=mux2_grp-mux2_pop]

dt[,`:=`(intx2=NULL, mux2_grp=NULL, mux2_pop=NULL)]

My final output will be:

`x1 x2 mux1_diff mux2_diff`

1: 1 21 -4.0 -3.5

2: 2 22 -4.0 -3.5

3: 3 23 -4.0 -3.5

4: 4 24 -4.0 -3.5

5: 5 25 0.5 -3.5

6: 6 26 0.5 0.5

7: 7 27 0.5 0.5

8: 8 28 0.5 0.5

9: 9 29 0.5 4.0

10: 10 30 4.5 4.0

11: 11 31 4.5 4.0

12: 12 32 4.5 4.0

How can I improve this code? Note that each variable has different user-specified intervals

Answer

We can do this in a compact one-step option (though it may be not that optimal when compared to the OP's method (from @Frank's comments)

```
dt[, mu_diff := mean(x) - mean(dt$x), by = .(cut(x, breaks = c(-Inf, 4, 9, Inf)))][]
# x mu_diff
#1: 1 -3.8636364
#2: 2 -3.8636364
#3: 3 -3.8636364
#4: 4 -3.8636364
#5: 5 0.3863636
#6: 6 0.3863636
#7: 7 0.3863636
#8: 9 0.3863636
#9: 10 4.6363636
#10:11 4.6363636
#11:12 4.6363636
```

If there are many variables (it is not clear whether we use the same `breaks`

in `cut`

or not for different columns - assuming that it is the same), we can loop through the columns (in the reproducible example below, two variables 'x1' and 'x2' are showed), specify the `.SDcols`

by the numeric index of the column, grouped by the `cut`

of the subsetted column, we assign the new column as the difference between the `mean`

of the values within the group and the `mean`

of the entire column.

```
nm1 <- paste0("mu_diff", seq_along(dt1))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = c(-Inf, 4, 9, Inf))) ,
.SDcols = j][]
}
```

**Update** - Suppose if the `breaks`

for `cut`

variable is *different* for each column, place it in a `list`

and get that `list`

element inside the `for`

loop with the indexing.

```
brkLst <- list(c(-Inf, 4, 9, Inf), c(-Inf, 10, 14, Inf))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = brkLst[[j]])) ,
.SDcols = j][]
}
```

Checking the output with OP's new data ('dt2')

```
brkLst2 <- list(c(-Inf, 4, 9, Inf), c(-Inf, 25, 28, Inf))
nm1 <- paste0("mu", names(dt2), "_diff")
for(j in seq_along(dt2)){
dt2[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt2[[names(dt2)[j]]]),
by = .(cut(get(names(dt2)[j]), breaks = brkLst2[[j]])) ,
.SDcols = j][]
}
dt2
# x1 x2 mux1_diff mux2_diff
# 1: 1 21 -4.0 -3.5
# 2: 2 22 -4.0 -3.5
# 3: 3 23 -4.0 -3.5
# 4: 4 24 -4.0 -3.5
# 5: 5 25 0.5 -3.5
# 6: 6 26 0.5 0.5
# 7: 7 27 0.5 0.5
# 8: 8 28 0.5 0.5
# 9: 9 29 0.5 4.0
#10: 10 30 4.5 4.0
#11: 11 31 4.5 4.0
#12: 12 32 4.5 4.0
```

```
dt1 <- data.table(x1 = c(1,2,3,4,5,6,7,9,10,11,12))[, x2 := x1 + 5][]
#OP's changed dataset
dt2 <- data.table(x1 = 1:12, x2=21:32)
```