Richard Herron Richard Herron - 3 months ago 12
R Question

How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame

I see a lot of questions and answers re

order
and
sort
. Is there anything that sorts vectors or data frames into groupings (like quartiles or deciles)? I have a "manual" solution, but there's likely a better solution that has been group-tested.

Here's my attempt:

temp <- data.frame(name=letters[1:12], value=rnorm(12), quartile=rep(NA, 12))
temp
# name value quartile
# 1 a 2.55118169 NA
# 2 b 0.79755259 NA
# 3 c 0.16918905 NA
# 4 d 1.73359245 NA
# 5 e 0.41027113 NA
# 6 f 0.73012966 NA
# 7 g -1.35901658 NA
# 8 h -0.80591167 NA
# 9 i 0.48966739 NA
# 10 j 0.88856758 NA
# 11 k 0.05146856 NA
# 12 l -0.12310229 NA
temp.sorted <- temp[order(temp$value), ]
temp.sorted$quartile <- rep(1:4, each=12/4)
temp <- temp.sorted[order(as.numeric(rownames(temp.sorted))), ]
temp
# name value quartile
# 1 a 2.55118169 4
# 2 b 0.79755259 3
# 3 c 0.16918905 2
# 4 d 1.73359245 4
# 5 e 0.41027113 2
# 6 f 0.73012966 3
# 7 g -1.35901658 1
# 8 h -0.80591167 1
# 9 i 0.48966739 3
# 10 j 0.88856758 4
# 11 k 0.05146856 2
# 12 l -0.12310229 1


Is there a better (cleaner/faster/one-line) approach? Thanks!

42- 42-
Answer

The method I use is one of these or Hmisc::cut2(value, g=4):

temp$quartile <- with(temp, cut(value, 
                                breaks=quantile(value, probs=seq(0,1, by=0.25), na.rm=TRUE), 
                                include.lowest=TRUE))

An alternate might be:

temp$quartile <- with(temp, factor(
                            findInterval( val, c(-Inf,
                               quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE), 
                            labels=c("Q1","Q2","Q3","Q4")
      ))

The first one has the side-effect of labeling the quartiles with the values, which I consider a "good thing", but if it were not "good for you", or the valid problems raised in the comments were a concern you could go with version 2. You can use labels= in cut, or you could add this line to your code:

temp$quartile <- factor(temp$quartile, levels=c("1","2","3","4") )

Or even quicker but slightly more obscure in how it works, although it is no longer a factor, but rather a numeric vector:

temp$quartile <- as.numeric(temp$quartile)