Alex Alex - 3 months ago 13
R Question

Calculate the percentile rank of a set of scores based on a population

I have a set of scores on 31 competencies for one individual

data <- data.frame(scores=sample(100,31,replace=T),row.names=paste0("X",1:31))


and I have a benchmark file for scores of 100 (or any number) individuals on the same 31 competencies

bmark <- data.frame(replicate(31,sample(0:100,100,rep=TRUE)))


I need to calculate the percentile rank of each competency score for the individual relative to the benchmark for that competency, the output should look something like this (random values, just to demonstrate the required format)

data <- data.frame(scores=sample(100,31,replace=T),percentile=sample(100,31,replace=T),row.names=paste0("X",1:31))


I can calculate each score's percentile rank using a basic percentile formula:

length(bmark$X31[bmark$X31<data$scores[rownames(data)=="X31"]])/length(bmark$X31)*100


But I don't know how to make this work across the whole data set at once, so that each row in
data$percentile
has the appropriate value for the corresponding competency in the benchmark file. I could do this with a loop, but still not great with R vectorisation.

Answer

I was a bit confused by your question, because data has scores ranging from 1 to 100, but bmark is all zeros and ones. It seemed like each column of bmark should have scores from 1 to 100. If I'm wrong, please let me know.

Here's a way to get the percentile on each benchmark for one subject. We use the empirical cumulative distribution function to calculate percentiles:

# Fake data
set.seed(595)
dat <- data.frame(scores=sample(100,31,replace=T),row.names=paste0("X",1:31))
bmark <- data.frame(replicate(31,sample(100,100,rep=TRUE)))

# Get percentile on each benchmark for one subject
dat$percentile = mapply(function(ref, subj) {
  ecdf(ref)(subj)*100
}, ref=bmark, subj=dat$scores)

dat
    scores percentile
X1      28         25
X2      25         30
X3      91         92
...
X29     42         46
X30     76         71
X31      1          2

Here's a boxplot of the distribution of bmark for each competency, along with a red dot showing where the subject scored on each of them:

boxplot(bmark)
points(1:31, data$scores, pch=16, col="red")

enter image description here

If you have multiple subjects, you can get their percentiles all at once. We take the mapply code from above, which calculates percentiles for a single subject, and wrap it in sapply, which feeds the mapply code each subject in succession and returns all of the results in a single matrix:

# Scores on 31 benchmarks for 20 subjects. Each column is a subject.
set.seed(58)
subjects = as.data.frame(replicate(20, sample(100, 31, replace=TRUE)))

# Get percentile on each benchmark for each subject
percentile.score = sapply(subjects, function(s) {
  mapply(function(ref, subj) {
    ecdf(ref)(subj)*100
  }, ref=bmark, subj=s)
})

percentile.score
    V1 V2 V3  V4 V5  V6 V7  V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
X1  28 95  2  24 95  18 61   6 44  11  82  50  28  37  59  41  91  51  35  76
X2  20 49 62   3 24  24 54  23 50  50  60  95  25  63  57  78  72  89  79  66
X3  69 37 40  49 59  30 91   5 92  43  56  22  34  33  28  54  35   1  59  29
...
X29 34 42 10  76 40  48 13  36 76  13  88  91  99  31  13  76  93  42  31  91
X30 16 66 86  56 21  67 86  45 81  16  70  66  24  11  23  18  32  53  31  32
X31 81 81 52   2  3  32 64  36 33  39  92 100  80   2  44  63  59   2  34  99