asuka asuka - 2 months ago 18
R Question

R and Random Forest: How caret and pROC deal with positive and negative class?

In the past days, I've been analyzing the performance of R's implementation of Random Forest and the different tools available in order to obtain:


  • AUC

  • Sensitivity

  • Specificity



Thus, I've used two different methods:


  • mroc and coords from pROC library in order to obtain the performance of the model at different cutoff points.

  • confusionMatrix from caret library in order to obtain the optimal performance of the model (AUC, Accuracy, Sensitivity, Specificity, ...)



The point is that I've realized that there is some differences between both approaches.

I've developed the following code:

suppressMessages(library(randomForest))
suppressMessages(library(pROC))
suppressMessages(library(caret))

set.seed(100)

t_x <- as.data.frame(matrix(runif(100),ncol=10))
t_y <- factor(sample(c("A","B"), 10, replace = T), levels=c("A","B"))

v_x <- as.data.frame(matrix(runif(50),ncol=10))
v_y <- factor(sample(c("A","B"), 5, replace = T), levels=c("A","B"))

model <- randomForest(t_x, t_y, ntree=1000, importance=T);
prob.out <- predict(model, v_x, type="prob")[,1];
prediction.out <- predict(model, v_x, type="response");

mroc <- roc(v_y,prob.out,plot=F)

results <- coords(mroc,seq(0, 1, by = 0.01),input=c("threshold"),ret=c("sensitivity","specificity","ppv","npv"))

accuracyData <- confusionMatrix(prediction.out,v_y)


If you compare the results and accuracyData variables, you can see that the relationship between sensitivity and specificity is the opposite.

That is, the confusionMatrix results are:

Confusion Matrix and Statistics

Reference
Prediction A B
A 1 1
B 2 1

Accuracy : 0.4
95% CI : (0.0527, 0.8534)
No Information Rate : 0.6
P-Value [Acc > NIR] : 0.913

Kappa : -0.1538
Mcnemar's Test P-Value : 1.000

Sensitivity : 0.3333
Specificity : 0.5000
Pos Pred Value : 0.5000
Neg Pred Value : 0.3333
Prevalence : 0.6000
Detection Rate : 0.2000
Detection Prevalence : 0.4000
Balanced Accuracy : 0.4167

'Positive' Class : A


But if I look for such Sensitivity and Specificity in the coords calculation, I find them swapped:

sensitivity specificity ppv npv
0.32 0.5 0.3333333 0.3333333 0.5000000


Apparently, Sensitivity and Specificity is are opposite in coords and confusionMatrix.

Taking into account that confusionMatrix identifies correctly the positive class, I assume that this good interpretation of Sensitivity and Specificity.

My question is: Is there any way of forcing coords to interpret the positive and negative classes in the way I want to?

Answer

If you look at the output of confusionMatrix, you can see this:

       'Positive' Class : A 

Now looking at mroc, class B is taken as the positive class:

Data: prob.out in 3 controls (v_y A) < 2 cases (v_y B).

Basically, pROC takes the levels of your factor as Negative, Positive and caret does the exact opposite. You can specify your levels explicitly with pROC to get the same behaviour:

mroc <- roc(v_y,prob.out,plot=F, levels = c("B", "A"))