Michael Ohlrogge - 1 year ago 70
R Question

# splitting a continuous variable into groups of equal number of elements - return numeric vector from bin values

I have a continuous variable that I want to split into bins, returning a numeric vector (of length equal to my original vector) whose values relate to the values of the bins. Each bin should have roughly the same number of elements.

This question: splitting a continuous variable into equal sized groups describes a number of techniques for related situations. For instance, if I start with

``````x = c(1,5,3,12,5,6,7)
``````

I can use
`cut()`
to get:

``````cut(x, 3, labels = FALSE)
[1] 1 2 1 3 2 2 2
``````

This is undesirable because the values of the factor are just sequential integers, they have no direct relation to the underlying original values in my vector.

Another possibility is
`cut2`
: for instance:

``````library(Hmisc)
cut2(x, g = 3, levels.mean = TRUE)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
``````

This better because now the return values relate to the values of the bins. It is still less than ideal though since:

• (a) it yields a factor which then needs to be converted to numeric (see, e.g.), which is both slow and awkward code wise.

• (b) Ideally I'd like to be able to choose whether to use the top or bottom end points of the intervals, instead of just the means.

I know that there are also options using regex on the factors returns from
`cut`
or
`cut2`
to get the top or bottom points of the intervals. These too seem overly cumbersome.

Is this just a situation that requires some not-so-elegant hacking? Or, is there some easier functionality to accomplish this?

My current best effort is as follows:

``````MyDiscretize = function(x, N_Bins){
f = cut2(x, g = N_Bins, levels.mean = TRUE)
return(as.numeric(levels(f))[f])
}
``````

My goal is to find something faster, more elegant, and easily adaptable to use either of the endpoints, rather than just the means.

Edit:

To clarify: my desired output would be:

• (a) an equivalent to what I can achieve right now in the example with
`cut2`
but without needing to convert the factor to numeric.

• (b) if possible, the ability to also easily chose to use either of the endpoints of the interval, instead of the midpoint.

Use `ave` like this:

Given:

``````x = c(1,5,3,12,5,6,7)
``````

Mean:

``````ave(x,cut2(x,g = 3), FUN = mean)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
``````

Min:

``````ave(x,cut2(x,g = 3), FUN = min)
[1] 1 1 1 7 1 6 7
``````

Max:

``````ave(x,cut2(x,g = 3), FUN = max)
[1]  5  5  5 12  5  6 12
``````

Or standard deviation:

``````ave(x,cut2(x,g = 3), FUN = sd)
[1] 1.914854 1.914854 1.914854 3.535534 1.914854       NA 3.535534
``````

Note the NA result for only one data point in interval.

Hope this is what you need.

NOTE:
Parameter g in `cut2` is number of quantile groups. Groups might not have the same amount of data points, and the intervals might not have the same length.
On the other hand, `cut` splits the interval into several of equal length.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download