John John - 1 month ago 26
R Question

Create groups based on percent_rank in dplyr

I am trying to create some groups based on the percent rank of some values in

dplyr
.

The code below creates a data frame and then
sapply
a function to determine the groups. The downside is that I can't get sapply to work for
tbl_postgres
, only data frames. So I'm curious if there is another solution for this.

I had considered something with ntile, but the groups I want to create have some arbitrary cut-offs. Also, I have not had much luck getting it to work with
dplyr
(maybe pure sql might work).

library(dplyr)

n <- 100

df1 <- data.frame(idx = 1:n, x = rnorm(n))

df1 <- df1 %>%
arrange(x) %>%
mutate(pc_x = percent_rank(x))

index <- function(x) {
if (x < 0) {
return(NA)
} else if (x < 0.3) {
return(1)
} else if (x < 0.7) {
return(2)
} else if (x <= 1) {
return(3)
} else {
return(NA)
}
}

df1 <- df1 %>%
mutate(group = sapply(pc_x, index))

Answer

Perhaps cut will serve your needs:

library(dplyr)
n <- 100
set.seed(42)
df1 <- data.frame(idx = 1:n, x = rnorm(n))
df1 <- df1 %>%
    arrange(x) %>%
    mutate(pc_x = percent_rank(x))

I use -1e9 in breaks because cut is "left-open", so if I used breaks <- c(0, ...) then the first row would be NA instead of 1.

breaks <- c(-1e9, 0.3, 0.7, 1)
df1 %>%
    mutate(grp = cut(pc_x, breaks=breaks, labels=FALSE)) %>%
    group_by(grp)
## Source: local data frame [100 x 4]
## Groups: grp [3]
##      idx          x       pc_x   grp
##    (int)      (dbl)      (dbl) (int)
## 1     59 -2.9930901 0.00000000     1
## 2     18 -2.6564554 0.01010101     1
## 3     19 -2.4404669 0.02020202     1
## 4     39 -2.4142076 0.03030303     1
## 5     22 -1.7813084 0.04040404     1
## ..   ...        ...        ...   ...
Comments