goutam - 4 months ago 23

R Question

I am trying to bin a variable with value between 1 to 100,000 into ten groups by 10,000. I am using the following code and getting an error.

`cut(x, breaks = quantile(x, probs=seq(0, 100000, 10000)), include.lowest = TRUE)`

What am I doing wrong?

Answer

Well, at first I saw this as a typo, but after some discussion in comments I decided to write an answer.

The error occurs to `quantile`

, as `probs`

should be between 0 and 1 (read `?quantile`

).

It looks like you have been confused with the following two:

```
cut(x, breaks = seq(0, 100000, 10000), include.lowest = TRUE)
cut(x, breaks = quantile(x, prob = seq(0, 1, 0.1)), include.lowest = TRUE)
```

As I said, they will give different result, especially when your data are not uniformly distributed.

As a representative example, consider non-uniformly distributed data, say Beta distributed:

```
set.seed(0)
x <- rbeta(10000, 3, 5)
b1 <- seq(0, 1, 0.1)
b2 <- quantile(x, prob = seq(0, 1, 0.1), names = FALSE)
round(b2, 2)
# [1] 0.01 0.17 0.23 0.28 0.32 0.37 0.41 0.46 0.52 0.60 0.94
```

Note, the difference between `b2`

and `b1`

are significant. You can inspect the (empirical) quantile-quantile plot:

```
plot(b1, b2); abline(0, 1)
```

You will see the dots deviates strongly from the line.

In above, `b1`

gives uniform bin cells, while `b2`

gives ragged bin cells. Now consider bin counts:

```
table(cut(x, breaks = b1, include.lowest = TRUE))
# [0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
# 256 1239 2011 2242 1948 1323 685 245
#(0.8,0.9] (0.9,1]
# 48 3
table(cut(x, breaks = b2, include.lowest = TRUE))
#[0.0101,0.169] (0.169,0.228] (0.228,0.276] (0.276,0.321] (0.321,0.365]
# 1000 1000 1000 1000 1000
# (0.365,0.412] (0.412,0.463] (0.463,0.519] (0.519,0.598] (0.598,0.935]
# 1000 1000 1000 1000 1000
```

Have you seen the difference? If we place break points by quantile, we will have uniform counts over bins.