Alex - 9 months ago 34

R Question

I have a set of scores on 31 competencies for one individual

`data <- data.frame(scores=sample(100,31,replace=T),row.names=paste0("X",1:31))`

and I have a benchmark file for scores of 100 (or any number) individuals on the same 31 competencies

`bmark <- data.frame(replicate(31,sample(0:100,100,rep=TRUE)))`

I need to calculate the percentile rank of each competency score for the individual relative to the benchmark for that competency, the output should look something like this (random values, just to demonstrate the required format)

`data <- data.frame(scores=sample(100,31,replace=T),percentile=sample(100,31,replace=T),row.names=paste0("X",1:31))`

I can calculate each score's percentile rank using a basic percentile formula:

`length(bmark$X31[bmark$X31<data$scores[rownames(data)=="X31"]])/length(bmark$X31)*100`

But I don't know how to make this work across the whole data set at once, so that each row in

`data$percentile`

Answer

I was a bit confused by your question, because `data`

has scores ranging from 1 to 100, but `bmark`

is all zeros and ones. It seemed like each column of `bmark`

should have scores from 1 to 100. If I'm wrong, please let me know.

Here's a way to get the percentile on each benchmark for one subject. We use the empirical cumulative distribution function to calculate percentiles:

```
# Fake data
set.seed(595)
dat <- data.frame(scores=sample(100,31,replace=T),row.names=paste0("X",1:31))
bmark <- data.frame(replicate(31,sample(100,100,rep=TRUE)))
# Get percentile on each benchmark for one subject
dat$percentile = mapply(function(ref, subj) {
ecdf(ref)(subj)*100
}, ref=bmark, subj=dat$scores)
dat
```

`scores percentile X1 28 25 X2 25 30 X3 91 92 ... X29 42 46 X30 76 71 X31 1 2`

Here's a boxplot of the distribution of `bmark`

for each competency, along with a red dot showing where the subject scored on each of them:

```
boxplot(bmark)
points(1:31, data$scores, pch=16, col="red")
```

If you have multiple subjects, you can get their percentiles all at once. We take the `mapply`

code from above, which calculates percentiles for a single subject, and wrap it in `sapply`

, which feeds the `mapply`

code each subject in succession and returns all of the results in a single matrix:

```
# Scores on 31 benchmarks for 20 subjects. Each column is a subject.
set.seed(58)
subjects = as.data.frame(replicate(20, sample(100, 31, replace=TRUE)))
# Get percentile on each benchmark for each subject
percentile.score = sapply(subjects, function(s) {
mapply(function(ref, subj) {
ecdf(ref)(subj)*100
}, ref=bmark, subj=s)
})
percentile.score
```

`V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 X1 28 95 2 24 95 18 61 6 44 11 82 50 28 37 59 41 91 51 35 76 X2 20 49 62 3 24 24 54 23 50 50 60 95 25 63 57 78 72 89 79 66 X3 69 37 40 49 59 30 91 5 92 43 56 22 34 33 28 54 35 1 59 29 ... X29 34 42 10 76 40 48 13 36 76 13 88 91 99 31 13 76 93 42 31 91 X30 16 66 86 56 21 67 86 45 81 16 70 66 24 11 23 18 32 53 31 32 X31 81 81 52 2 3 32 64 36 33 39 92 100 80 2 44 63 59 2 34 99`

Source (Stackoverflow)