st19297 - 6 months ago 25

R Question

Imagine we have a data frame like this:

`df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))`

x y z

1 10 8 0

2 11 9 1

3 12 10 2

4 13 11 3

5 14 12 4

6 15 13 5

7 16 14 6

8 17 15 7

9 18 16 8

10 19 17 9

11 20 18 10

How can we select the cases the are in the HIGHEST percentile on all X, Y and Z? I need a code that searches for cases in the top 1% on all variables, then if it finds nothing, loosens up the criterion to 2%, then 3% and so on until it finds m cases that is in the highest percentile on all the variables. We need to set m as we desire.

Answer

I think this should do the trick for you:

```
df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))
#defining function - df is input frame, cases is the 'm' you are looking for
#startingperc is just the percentage level you want to start with and tickrate
#is the rate at which you decrease the perentile until you get m cases
myfunc <- function(df, cases, startingperc, tickrate){
found <- 0
while(found < cases) {
quants <- apply(df, 2, quantile, probs = startingperc)
indices <- which(apply(df, 1, function(x) all(x > quants)) == TRUE)
found <- length(indices)
if(found < cases) {startingperc <- startingperc - tickrate}
}
#added this to handle a tickrate that is too large
if (length(indices) > cases) {
indices <- rev(indices[order(apply(df[indices,],1, sum), decreasing = T)[1:cases]])
}
return(df[indices,])
}
#in use
myfunc(df, 5, .99, .01)
```

Giving:

```
> myfunc(df, 5, .99, .01)
x y z
7 16 14 6
8 17 15 7
9 18 16 8
10 19 17 9
11 20 18 10
```