st19297 st19297 - 24 days ago 7
R Question

How to find the cases in the top n percentile of several variables at the same time?

Imagine we have a data frame like this:

df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))

x y z
1 10 8 0
2 11 9 1
3 12 10 2
4 13 11 3
5 14 12 4
6 15 13 5
7 16 14 6
8 17 15 7
9 18 16 8
10 19 17 9
11 20 18 10


How can we select the cases the are in the HIGHEST percentile on all X, Y and Z? I need a code that searches for cases in the top 1% on all variables, then if it finds nothing, loosens up the criterion to 2%, then 3% and so on until it finds m cases that is in the highest percentile on all the variables. We need to set m as we desire.

Answer

I think this should do the trick for you:

df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))

#defining function - df is input frame, cases is the 'm' you are looking for
#startingperc is just the percentage level you want to start with and tickrate
#is the rate at which you decrease the perentile until you get m cases
myfunc <- function(df, cases, startingperc, tickrate){
  found <- 0
  while(found < cases) {
    quants <- apply(df, 2, quantile, probs = startingperc)
    indices <- which(apply(df, 1, function(x) all(x > quants)) == TRUE)
    found <- length(indices)
    if(found < cases) {startingperc <- startingperc - tickrate}
  }
  #added this to handle a tickrate that is too large
  if (length(indices) > cases) {
    indices <- rev(indices[order(apply(df[indices,],1, sum), decreasing = T)[1:cases]])
  }
  return(df[indices,])
}

#in use
myfunc(df, 5, .99, .01)

Giving:

> myfunc(df, 5, .99, .01)
    x  y  z
7  16 14  6
8  17 15  7
9  18 16  8
10 19 17  9
11 20 18 10