John - 5 months ago 56

R Question

I am trying to create some groups based on the percent rank of some values in

`dplyr`

The code below creates a data frame and then

`sapply`

`tbl_postgres`

I had considered something with ntile, but the groups I want to create have some arbitrary cut-offs. Also, I have not had much luck getting it to work with

`dplyr`

`library(dplyr)`

n <- 100

df1 <- data.frame(idx = 1:n, x = rnorm(n))

df1 <- df1 %>%

arrange(x) %>%

mutate(pc_x = percent_rank(x))

index <- function(x) {

if (x < 0) {

return(NA)

} else if (x < 0.3) {

return(1)

} else if (x < 0.7) {

return(2)

} else if (x <= 1) {

return(3)

} else {

return(NA)

}

}

df1 <- df1 %>%

mutate(group = sapply(pc_x, index))

Answer

Perhaps `cut`

will serve your needs:

```
library(dplyr)
n <- 100
set.seed(42)
df1 <- data.frame(idx = 1:n, x = rnorm(n))
df1 <- df1 %>%
arrange(x) %>%
mutate(pc_x = percent_rank(x))
```

I use `-1e9`

in `breaks`

because `cut`

is "left-open", so if I used `breaks <- c(0, ...)`

then the first row would be `NA`

instead of 1.

```
breaks <- c(-1e9, 0.3, 0.7, 1)
df1 %>%
mutate(grp = cut(pc_x, breaks=breaks, labels=FALSE)) %>%
group_by(grp)
## Source: local data frame [100 x 4]
## Groups: grp [3]
## idx x pc_x grp
## (int) (dbl) (dbl) (int)
## 1 59 -2.9930901 0.00000000 1
## 2 18 -2.6564554 0.01010101 1
## 3 19 -2.4404669 0.02020202 1
## 4 39 -2.4142076 0.03030303 1
## 5 22 -1.7813084 0.04040404 1
## .. ... ... ... ...
```