statquant - 8 months ago 45

R Question

I must be doing something obviously stupid here but can someone explain why it appears like

`data.table`

`set.seed(1)`

DT = data.table(grp=c(rep('a',100),rep('b',100)), val=c(runif(100), rnorm(100)))

DT[grp=='a',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

10% 20% 30% 40% 50% 60% 70% 80% 90%

-Inf 0.1415 0.2555 0.3448 0.4108 0.4878 0.6442 0.7140 0.7842 0.8703 Inf

DT[grp=='b',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

10% 20% 30% 40% 50% 60% 70% 80% 90%

-Inf -1.22751 -0.66000 -0.55036 -0.32170 -0.11762 0.06583 0.37427 0.69183 1.35196 Inf

DT[,interval:=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)),.(grp)][]

grp val interval

1: a 0.2655 (-0.66,-0.55] => this is a "b" interval ? I would expect (0.2555 0.3448]

2: a 0.3721 (-0.55,-0.322]

3: a 0.5729 (-0.118,0.0658]

4: a 0.9082 (1.35, Inf]

5: a 0.2017 (-1.23,-0.66]

---

196: b -0.7508 (-1.23,-0.66]

197: b 2.0872 (1.35, Inf]

198: b 0.0174 (-0.118,0.0658]

199: b -1.2863 (-Inf,-1.23]

200: b -1.6406 (-Inf,-1.23]

I typically want to do things like :

`DT[,mean(val),keyby=.(grp,interval=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))]`

grp interval V1

1: a (-0.321,0.0379] 0.01836077 => this is not a "a" interval

2: a (0.0379,0.21] 0.13190935

3: a (0.21,0.358] 0.29068707

4: a (0.358,0.477] 0.41647597

5: a (0.477,0.648] 0.55190648

6: a (0.648,0.777] 0.70883795

7: a (0.777,0.915] 0.84091210

8: a (0.915, Inf] 0.95797615

9: b (-Inf,-0.657] -1.23322909

10: b (-0.657,-0.321] -0.53243898

11: b (-0.321,0.0379] -0.13968720

12: b (0.0379,0.21] 0.11278201

13: b (0.21,0.358] 0.30783459

14: b (0.358,0.477] 0.40695489

15: b (0.477,0.648] 0.55976052

16: b (0.648,0.777] 0.70483170

17: b (0.777,0.915] 0.91017423

18: b (0.915, Inf] 1.57112705

And this looks suspiciously like if the intervals were defined on the WHOLE dataset instead of the groups

`DT[,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]`

10% 20% 30% 40% 50% 60% 70% 80% 90%

-Inf -0.65729223 -0.32084835 0.03788176 0.20967534 0.35835115 0.47738589 0.64820328 0.77734560 0.91505885 Inf

Answer

It looks like you are expecting a fancy way of combining factor levels (which is what `cut`

creates) across groups. Instead, you found strange behavior, which is typical for factors.

I guess you could use strings:

```
DT[,interval :=
as.character(cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))
, by=grp]
```

which gives

```
grp val interval
1: a 0.26550866 (0.256,0.345]
2: a 0.37212390 (0.345,0.411]
3: a 0.57285336 (0.488,0.644]
4: a 0.90820779 (0.87, Inf]
5: a 0.20168193 (0.142,0.256]
---
196: b -0.75081900 (-1.23,-0.66]
197: b 2.08716655 (1.35, Inf]
198: b 0.01739562 (-0.118,0.0658]
199: b -1.28630053 (-Inf,-1.23]
200: b -1.64060553 (-Inf,-1.23]
```

These intervals are not good for much of anything, however. If you try to sort by them, like `DT[, mean(val), keyby=.(grp, interval)]`

, you'll see that they are out of order.

If you just want these cuts around for a single calculation...

```
mycut = function(x) cut(x,c(-Inf,quantile(x,probs=seq(.1,.9,.1)),Inf))
DT[,{
.SD[, mean(val), keyby=.(interval=mycut(val))][, interval := as.character(interval)]
},keyby=grp]
```

which gives

```
grp interval V1
1: a (-Inf,0.142] 0.07670249
2: a (0.142,0.256] 0.20584852
3: a (0.256,0.345] 0.30715649
4: a (0.345,0.411] 0.38583465
5: a (0.411,0.488] 0.45901975
6: a (0.488,0.644] 0.56413855
7: a (0.644,0.714] 0.67442643
8: a (0.714,0.784] 0.75834958
9: a (0.784,0.87] 0.82747749
10: a (0.87, Inf] 0.91951669
11: b (-Inf,-1.23] -1.54198329
12: b (-1.23,-0.66] -0.92447488
13: b (-0.66,-0.55] -0.61458549
14: b (-0.55,-0.322] -0.45029247
15: b (-0.322,-0.118] -0.22533466
16: b (-0.118,0.0658] -0.01587467
17: b (0.0658,0.374] 0.24836075
18: b (0.374,0.692] 0.53061032
19: b (0.692,1.35] 1.01688411
20: b (1.35, Inf] 1.80089535
```

Yeah, not very elegant, but I think this is a problem from R itself, and it's not obvious how it should change to address your case.