statquant statquant - 3 months ago 12
R Question

Can you explain this by-group data.table result

I must be doing something obviously stupid here but can someone explain why it appears like

data.table
is not doing the following by group operation

set.seed(1)
DT = data.table(grp=c(rep('a',100),rep('b',100)), val=c(runif(100), rnorm(100)))
DT[grp=='a',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

10% 20% 30% 40% 50% 60% 70% 80% 90%
-Inf 0.1415 0.2555 0.3448 0.4108 0.4878 0.6442 0.7140 0.7842 0.8703 Inf

DT[grp=='b',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

10% 20% 30% 40% 50% 60% 70% 80% 90%
-Inf -1.22751 -0.66000 -0.55036 -0.32170 -0.11762 0.06583 0.37427 0.69183 1.35196 Inf

DT[,interval:=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)),.(grp)][]

grp val interval
1: a 0.2655 (-0.66,-0.55] => this is a "b" interval ? I would expect (0.2555 0.3448]
2: a 0.3721 (-0.55,-0.322]
3: a 0.5729 (-0.118,0.0658]
4: a 0.9082 (1.35, Inf]
5: a 0.2017 (-1.23,-0.66]
---
196: b -0.7508 (-1.23,-0.66]
197: b 2.0872 (1.35, Inf]
198: b 0.0174 (-0.118,0.0658]
199: b -1.2863 (-Inf,-1.23]
200: b -1.6406 (-Inf,-1.23]


I typically want to do things like :

DT[,mean(val),keyby=.(grp,interval=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))]
grp interval V1
1: a (-0.321,0.0379] 0.01836077 => this is not a "a" interval
2: a (0.0379,0.21] 0.13190935
3: a (0.21,0.358] 0.29068707
4: a (0.358,0.477] 0.41647597
5: a (0.477,0.648] 0.55190648
6: a (0.648,0.777] 0.70883795
7: a (0.777,0.915] 0.84091210
8: a (0.915, Inf] 0.95797615
9: b (-Inf,-0.657] -1.23322909
10: b (-0.657,-0.321] -0.53243898
11: b (-0.321,0.0379] -0.13968720
12: b (0.0379,0.21] 0.11278201
13: b (0.21,0.358] 0.30783459
14: b (0.358,0.477] 0.40695489
15: b (0.477,0.648] 0.55976052
16: b (0.648,0.777] 0.70483170
17: b (0.777,0.915] 0.91017423
18: b (0.915, Inf] 1.57112705


And this looks suspiciously like if the intervals were defined on the WHOLE dataset instead of the groups

DT[,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]
10% 20% 30% 40% 50% 60% 70% 80% 90%
-Inf -0.65729223 -0.32084835 0.03788176 0.20967534 0.35835115 0.47738589 0.64820328 0.77734560 0.91505885 Inf

Answer

It looks like you are expecting a fancy way of combining factor levels (which is what cut creates) across groups. Instead, you found strange behavior, which is typical for factors.

I guess you could use strings:

DT[,interval := 
  as.character(cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))
, by=grp]

which gives

     grp         val        interval
  1:   a  0.26550866   (0.256,0.345]
  2:   a  0.37212390   (0.345,0.411]
  3:   a  0.57285336   (0.488,0.644]
  4:   a  0.90820779     (0.87, Inf]
  5:   a  0.20168193   (0.142,0.256]
 ---                                
196:   b -0.75081900   (-1.23,-0.66]
197:   b  2.08716655     (1.35, Inf]
198:   b  0.01739562 (-0.118,0.0658]
199:   b -1.28630053    (-Inf,-1.23]
200:   b -1.64060553    (-Inf,-1.23]

These intervals are not good for much of anything, however. If you try to sort by them, like DT[, mean(val), keyby=.(grp, interval)], you'll see that they are out of order.


If you just want these cuts around for a single calculation...

mycut = function(x) cut(x,c(-Inf,quantile(x,probs=seq(.1,.9,.1)),Inf))

DT[,{
    .SD[, mean(val), keyby=.(interval=mycut(val))][, interval := as.character(interval)]
},keyby=grp]

which gives

    grp        interval          V1
 1:   a    (-Inf,0.142]  0.07670249
 2:   a   (0.142,0.256]  0.20584852
 3:   a   (0.256,0.345]  0.30715649
 4:   a   (0.345,0.411]  0.38583465
 5:   a   (0.411,0.488]  0.45901975
 6:   a   (0.488,0.644]  0.56413855
 7:   a   (0.644,0.714]  0.67442643
 8:   a   (0.714,0.784]  0.75834958
 9:   a    (0.784,0.87]  0.82747749
10:   a     (0.87, Inf]  0.91951669
11:   b    (-Inf,-1.23] -1.54198329
12:   b   (-1.23,-0.66] -0.92447488
13:   b   (-0.66,-0.55] -0.61458549
14:   b  (-0.55,-0.322] -0.45029247
15:   b (-0.322,-0.118] -0.22533466
16:   b (-0.118,0.0658] -0.01587467
17:   b  (0.0658,0.374]  0.24836075
18:   b   (0.374,0.692]  0.53061032
19:   b    (0.692,1.35]  1.01688411
20:   b     (1.35, Inf]  1.80089535

Yeah, not very elegant, but I think this is a problem from R itself, and it's not obvious how it should change to address your case.

Comments