Richard Herron -4 years ago 95
R Question

# How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame

I see a lot of questions and answers re

`order`
and
`sort`
. Is there anything that sorts vectors or data frames into groupings (like quartiles or deciles)? I have a "manual" solution, but there's likely a better solution that has been group-tested.

Here's my attempt:

``````temp <- data.frame(name=letters[1:12], value=rnorm(12), quartile=rep(NA, 12))
temp
#    name       value quartile
# 1     a  2.55118169       NA
# 2     b  0.79755259       NA
# 3     c  0.16918905       NA
# 4     d  1.73359245       NA
# 5     e  0.41027113       NA
# 6     f  0.73012966       NA
# 7     g -1.35901658       NA
# 8     h -0.80591167       NA
# 9     i  0.48966739       NA
# 10    j  0.88856758       NA
# 11    k  0.05146856       NA
# 12    l -0.12310229       NA
temp.sorted <- temp[order(temp\$value), ]
temp.sorted\$quartile <- rep(1:4, each=12/4)
temp <- temp.sorted[order(as.numeric(rownames(temp.sorted))), ]
temp
#    name       value quartile
# 1     a  2.55118169        4
# 2     b  0.79755259        3
# 3     c  0.16918905        2
# 4     d  1.73359245        4
# 5     e  0.41027113        2
# 6     f  0.73012966        3
# 7     g -1.35901658        1
# 8     h -0.80591167        1
# 9     i  0.48966739        3
# 10    j  0.88856758        4
# 11    k  0.05146856        2
# 12    l -0.12310229        1
``````

Is there a better (cleaner/faster/one-line) approach? Thanks!

The method I use is one of these or `Hmisc::cut2(value, g=4)`:

``````temp\$quartile <- with(temp, cut(value,
breaks=quantile(value, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
``````

An alternate might be:

``````temp\$quartile <- with(temp, factor(
findInterval( val, c(-Inf,
quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE),
labels=c("Q1","Q2","Q3","Q4")
))
``````

The first one has the side-effect of labeling the quartiles with the values, which I consider a "good thing", but if it were not "good for you", or the valid problems raised in the comments were a concern you could go with version 2. You can use `labels=` in `cut`, or you could add this line to your code:

``````temp\$quartile <- factor(temp\$quartile, levels=c("1","2","3","4") )
``````

Or even quicker but slightly more obscure in how it works, although it is no longer a factor, but rather a numeric vector:

``````temp\$quartile <- as.numeric(temp\$quartile)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download