statquant - 3 months ago 12
R Question

# Can you explain this by-group data.table result

I must be doing something obviously stupid here but can someone explain why it appears like

`data.table`
is not doing the following by group operation

``````set.seed(1)
DT = data.table(grp=c(rep('a',100),rep('b',100)), val=c(runif(100), rnorm(100)))
DT[grp=='a',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

10%    20%    30%    40%    50%    60%    70%    80%    90%
-Inf 0.1415 0.2555 0.3448 0.4108 0.4878 0.6442 0.7140 0.7842 0.8703    Inf

DT[grp=='b',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

10%      20%      30%      40%      50%      60%      70%      80%      90%
-Inf -1.22751 -0.66000 -0.55036 -0.32170 -0.11762  0.06583  0.37427  0.69183  1.35196      Inf

DT[,interval:=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)),.(grp)][]

grp     val        interval
1:   a  0.2655   (-0.66,-0.55] => this is a "b" interval ? I would expect (0.2555 0.3448]
2:   a  0.3721  (-0.55,-0.322]
3:   a  0.5729 (-0.118,0.0658]
4:   a  0.9082     (1.35, Inf]
5:   a  0.2017   (-1.23,-0.66]
---
196:   b -0.7508   (-1.23,-0.66]
197:   b  2.0872     (1.35, Inf]
198:   b  0.0174 (-0.118,0.0658]
199:   b -1.2863    (-Inf,-1.23]
200:   b -1.6406    (-Inf,-1.23]
``````

I typically want to do things like :

``````DT[,mean(val),keyby=.(grp,interval=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))]
grp        interval          V1
1:   a (-0.321,0.0379]  0.01836077  => this is not a "a" interval
2:   a   (0.0379,0.21]  0.13190935
3:   a    (0.21,0.358]  0.29068707
4:   a   (0.358,0.477]  0.41647597
5:   a   (0.477,0.648]  0.55190648
6:   a   (0.648,0.777]  0.70883795
7:   a   (0.777,0.915]  0.84091210
8:   a    (0.915, Inf]  0.95797615
9:   b   (-Inf,-0.657] -1.23322909
10:   b (-0.657,-0.321] -0.53243898
11:   b (-0.321,0.0379] -0.13968720
12:   b   (0.0379,0.21]  0.11278201
13:   b    (0.21,0.358]  0.30783459
14:   b   (0.358,0.477]  0.40695489
15:   b   (0.477,0.648]  0.55976052
16:   b   (0.648,0.777]  0.70483170
17:   b   (0.777,0.915]  0.91017423
18:   b    (0.915, Inf]  1.57112705
``````

And this looks suspiciously like if the intervals were defined on the WHOLE dataset instead of the groups

``````DT[,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]
10%         20%         30%         40%         50%         60%         70%         80%         90%
-Inf -0.65729223 -0.32084835  0.03788176  0.20967534  0.35835115  0.47738589  0.64820328  0.77734560  0.91505885         Inf
``````

It looks like you are expecting a fancy way of combining factor levels (which is what `cut` creates) across groups. Instead, you found strange behavior, which is typical for factors.

I guess you could use strings:

``````DT[,interval :=
as.character(cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))
, by=grp]
``````

which gives

``````     grp         val        interval
1:   a  0.26550866   (0.256,0.345]
2:   a  0.37212390   (0.345,0.411]
3:   a  0.57285336   (0.488,0.644]
4:   a  0.90820779     (0.87, Inf]
5:   a  0.20168193   (0.142,0.256]
---
196:   b -0.75081900   (-1.23,-0.66]
197:   b  2.08716655     (1.35, Inf]
198:   b  0.01739562 (-0.118,0.0658]
199:   b -1.28630053    (-Inf,-1.23]
200:   b -1.64060553    (-Inf,-1.23]
``````

These intervals are not good for much of anything, however. If you try to sort by them, like `DT[, mean(val), keyby=.(grp, interval)]`, you'll see that they are out of order.

If you just want these cuts around for a single calculation...

``````mycut = function(x) cut(x,c(-Inf,quantile(x,probs=seq(.1,.9,.1)),Inf))

DT[,{
.SD[, mean(val), keyby=.(interval=mycut(val))][, interval := as.character(interval)]
},keyby=grp]
``````

which gives

``````    grp        interval          V1
1:   a    (-Inf,0.142]  0.07670249
2:   a   (0.142,0.256]  0.20584852
3:   a   (0.256,0.345]  0.30715649
4:   a   (0.345,0.411]  0.38583465
5:   a   (0.411,0.488]  0.45901975
6:   a   (0.488,0.644]  0.56413855
7:   a   (0.644,0.714]  0.67442643
8:   a   (0.714,0.784]  0.75834958
9:   a    (0.784,0.87]  0.82747749
10:   a     (0.87, Inf]  0.91951669
11:   b    (-Inf,-1.23] -1.54198329
12:   b   (-1.23,-0.66] -0.92447488
13:   b   (-0.66,-0.55] -0.61458549
14:   b  (-0.55,-0.322] -0.45029247
15:   b (-0.322,-0.118] -0.22533466
16:   b (-0.118,0.0658] -0.01587467
17:   b  (0.0658,0.374]  0.24836075
18:   b   (0.374,0.692]  0.53061032
19:   b    (0.692,1.35]  1.01688411
20:   b     (1.35, Inf]  1.80089535
``````

Yeah, not very elegant, but I think this is a problem from R itself, and it's not obvious how it should change to address your case.