Michael Ohlrogge - 9 months ago 31

R Question

I have a continuous variable that I want to split into bins, returning a numeric vector (of length equal to my original vector) whose values relate to the values of the bins. Each bin should have roughly the same number of elements.

This question: splitting a continuous variable into equal sized groups describes a number of techniques for related situations. For instance, if I start with

`x = c(1,5,3,12,5,6,7)`

I can use

`cut()`

`cut(x, 3, labels = FALSE)`

[1] 1 2 1 3 2 2 2

This is undesirable because the values of the factor are just sequential integers, they have no direct relation to the underlying original values in my vector.

Another possibility is

`cut2`

`library(Hmisc)`

cut2(x, g = 3, levels.mean = TRUE)

[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5

This better because now the return values relate to the values of the bins. It is still less than ideal though since:

- (a) it yields a factor which then needs to be converted to numeric (see, e.g.), which is both slow and awkward code wise.
- (b) Ideally I'd like to be able to choose whether to use the top or bottom end points of the intervals, instead of just the means.

I know that there are also options using regex on the factors returns from

`cut`

`cut2`

Is this just a situation that requires some not-so-elegant hacking? Or, is there some easier functionality to accomplish this?

`MyDiscretize = function(x, N_Bins){`

f = cut2(x, g = N_Bins, levels.mean = TRUE)

return(as.numeric(levels(f))[f])

}

My goal is to find something faster, more elegant, and easily adaptable to use either of the endpoints, rather than just the means.

To clarify: my desired output would be:

- (a) an equivalent to what I can achieve right now in the example with but without needing to convert the factor to numeric.
`cut2`

- (b) if possible, the ability to also easily chose to use either of the endpoints of the interval, instead of the midpoint.

Answer

Use `ave`

like this:

Given:

```
x = c(1,5,3,12,5,6,7)
```

Mean:

```
ave(x,cut2(x,g = 3), FUN = mean)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
```

Min:

```
ave(x,cut2(x,g = 3), FUN = min)
[1] 1 1 1 7 1 6 7
```

Max:

```
ave(x,cut2(x,g = 3), FUN = max)
[1] 5 5 5 12 5 6 12
```

Or standard deviation:

```
ave(x,cut2(x,g = 3), FUN = sd)
[1] 1.914854 1.914854 1.914854 3.535534 1.914854 NA 3.535534
```

Note the NA result for only one data point in interval.

Hope this is what you need.

NOTE:

Parameter g in `cut2`

is number of quantile groups. Groups might not have the same amount of data points, and the intervals might not have the same length.

On the other hand, `cut`

splits the interval into several of equal length.

Source (Stackoverflow)