goutam - 1 year ago 75
R Question

# Error while Binning in R

I am trying to bin a variable with value between 1 to 100,000 into ten groups by 10,000. I am using the following code and getting an error. Please help me out where I am getting in wrong.

``````Data\$Amount_fac <- cut(Data\$Amount,breaks=quantile(Data\$Amount, probs=seq(from=0, to=100000, by=10000)),include.lowest=TRUE)
``````

The error is
`unexpected = 0`

Well, at first I see this as a typo, but after some discussion via comments I decide to write an answer.

The error occurs to `quantile`, as `probs` should be between 0 and 1 (read `?quantile`).

It looks like you have been confused with the following two:

``````cut(Data\$Amount, breaks = seq(0, 100000, 10000), include.lowest = TRUE)
cut(Data\$Amount, breaks = quantile(Data\$Amount, prob = seq(0, 1, 0.1)),
include.lowest = TRUE)
``````

As I said, they will give different result. It is sufficient to check with `breaks`.

As a representative example, consider non-uniformly distributed data, say Beta distributed:

``````set.seed(0)
x <- rbeta(10000, 3, 5)

b1 <- seq(0, 1, 0.1)

b2 <- quantile(x, prob = seq(0, 1, 0.1), names = FALSE)
round(b2, 2)
# [1] 0.01 0.17 0.23 0.28 0.32 0.37 0.41 0.46 0.52 0.60 0.94
``````

Note, the difference between `b2` and `b1` are significant. You can inspect the (empirical) quantile-quantile plot:

``````plot(b1, b2); abline(0, 1)
``````

You will see the dots deviates strongly from the line. So, `b1` gives uniform bin cells, while `b2` gives ragged bin cells.

Now consider bin counts:

``````table(cut(x, breaks = b1, include.lowest = TRUE))
#  [0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
#      256      1239      2011      2242      1948      1323       685       245
#(0.8,0.9]   (0.9,1]
#       48         3

table(cut(x, breaks = b2, include.lowest = TRUE))
#[0.0101,0.169]  (0.169,0.228]  (0.228,0.276]  (0.276,0.321]  (0.321,0.365]
#          1000           1000           1000           1000           1000
# (0.365,0.412]  (0.412,0.463]  (0.463,0.519]  (0.519,0.598]  (0.598,0.935]
#          1000           1000           1000           1000           1000
``````

Have you seen the difference?

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download