Alex - 9 months ago 34
R Question

# Calculate the percentile rank of a set of scores based on a population

I have a set of scores on 31 competencies for one individual

``````data <- data.frame(scores=sample(100,31,replace=T),row.names=paste0("X",1:31))
``````

and I have a benchmark file for scores of 100 (or any number) individuals on the same 31 competencies

``````bmark <- data.frame(replicate(31,sample(0:100,100,rep=TRUE)))
``````

I need to calculate the percentile rank of each competency score for the individual relative to the benchmark for that competency, the output should look something like this (random values, just to demonstrate the required format)

``````data <- data.frame(scores=sample(100,31,replace=T),percentile=sample(100,31,replace=T),row.names=paste0("X",1:31))
``````

I can calculate each score's percentile rank using a basic percentile formula:

``````length(bmark\$X31[bmark\$X31<data\$scores[rownames(data)=="X31"]])/length(bmark\$X31)*100
``````

But I don't know how to make this work across the whole data set at once, so that each row in
`data\$percentile`
has the appropriate value for the corresponding competency in the benchmark file. I could do this with a loop, but still not great with R vectorisation.

I was a bit confused by your question, because `data` has scores ranging from 1 to 100, but `bmark` is all zeros and ones. It seemed like each column of `bmark` should have scores from 1 to 100. If I'm wrong, please let me know.

Here's a way to get the percentile on each benchmark for one subject. We use the empirical cumulative distribution function to calculate percentiles:

``````# Fake data
set.seed(595)
dat <- data.frame(scores=sample(100,31,replace=T),row.names=paste0("X",1:31))
bmark <- data.frame(replicate(31,sample(100,100,rep=TRUE)))

# Get percentile on each benchmark for one subject
dat\$percentile = mapply(function(ref, subj) {
ecdf(ref)(subj)*100
}, ref=bmark, subj=dat\$scores)

dat
``````
``````    scores percentile
X1      28         25
X2      25         30
X3      91         92
...
X29     42         46
X30     76         71
X31      1          2
``````

Here's a boxplot of the distribution of `bmark` for each competency, along with a red dot showing where the subject scored on each of them:

``````boxplot(bmark)
points(1:31, data\$scores, pch=16, col="red")
``````

If you have multiple subjects, you can get their percentiles all at once. We take the `mapply` code from above, which calculates percentiles for a single subject, and wrap it in `sapply`, which feeds the `mapply` code each subject in succession and returns all of the results in a single matrix:

``````# Scores on 31 benchmarks for 20 subjects. Each column is a subject.
set.seed(58)
subjects = as.data.frame(replicate(20, sample(100, 31, replace=TRUE)))

# Get percentile on each benchmark for each subject
percentile.score = sapply(subjects, function(s) {
mapply(function(ref, subj) {
ecdf(ref)(subj)*100
}, ref=bmark, subj=s)
})

percentile.score
``````
``````    V1 V2 V3  V4 V5  V6 V7  V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
X1  28 95  2  24 95  18 61   6 44  11  82  50  28  37  59  41  91  51  35  76
X2  20 49 62   3 24  24 54  23 50  50  60  95  25  63  57  78  72  89  79  66
X3  69 37 40  49 59  30 91   5 92  43  56  22  34  33  28  54  35   1  59  29
...
X29 34 42 10  76 40  48 13  36 76  13  88  91  99  31  13  76  93  42  31  91
X30 16 66 86  56 21  67 86  45 81  16  70  66  24  11  23  18  32  53  31  32
X31 81 81 52   2  3  32 64  36 33  39  92 100  80   2  44  63  59   2  34  99
``````