st19297 - 1 year ago 48
R Question

# How to find the cases in the top n percentile of several variables at the same time?

Imagine we have a data frame like this:

df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))

x y z
1 10 8 0
2 11 9 1
3 12 10 2
4 13 11 3
5 14 12 4
6 15 13 5
7 16 14 6
8 17 15 7
9 18 16 8
10 19 17 9
11 20 18 10

How can we select the cases the are in the HIGHEST percentile on all X, Y and Z? I need a code that searches for cases in the top 1% on all variables, then if it finds nothing, loosens up the criterion to 2%, then 3% and so on until it finds m cases that is in the highest percentile on all the variables. We need to set m as we desire.

I think this should do the trick for you:

df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))

#defining function - df is input frame, cases is the 'm' you are looking for
#startingperc is just the percentage level you want to start with and tickrate
#is the rate at which you decrease the perentile until you get m cases
myfunc <- function(df, cases, startingperc, tickrate){
found <- 0
while(found < cases) {
quants <- apply(df, 2, quantile, probs = startingperc)
indices <- which(apply(df, 1, function(x) all(x > quants)) == TRUE)
found <- length(indices)
if(found < cases) {startingperc <- startingperc - tickrate}
}
#added this to handle a tickrate that is too large
if (length(indices) > cases) {
indices <- rev(indices[order(apply(df[indices,],1, sum), decreasing = T)[1:cases]])
}
return(df[indices,])
}

#in use
myfunc(df, 5, .99, .01)

Giving:

> myfunc(df, 5, .99, .01)
x  y  z
7  16 14  6
8  17 15  7
9  18 16  8
10 19 17  9
11 20 18 10
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download