Arash Howaida Arash Howaida - 4 months ago 6
R Question

Understanding Graph of Binary Response Regression

please refer to this image: enter image description here

I believe it is generated using R or SAS or something. I want to make sure I understand what it is depicting and recreate it from scratch.

I understand the left hand side, the ROC curve and I have generated my own using my probit model at varying thresholds.

What I do not understand is the right hand side graph. What does it mean by 'cost' function? What are the units? I assume the x axis labeled: 'threshold' is the success cutoff threshold I used in the ROC. My only guess is the Y axis is the sum of squared residuals? But if that's the case, I'd have to get the residuals after each iteration of the threshold?

Please explain what the axes are and how one goes about computing them.

--Edit--
For clarity, I don't need a proof or a line of code. Because I use a different statistical software, it's much more useful to have someone explain conceptually (with minimal jargon) how to compute the Y axis. That way I can write it in terms of my software's language.

Thank you

Answer

I will try to make this as clear as possible. The term cost function can be used in multiple cases and it can have multiple meanings. Usually, when we use the term in the context of a regression model, it is natural that we think of minimizing the sum of the squared residuals.

However, this is not the case here (we still do it because we are interested in minimizing the function but that function is not minimized within an algorithm like the sum of the squared residuals). Let me elaborate on what the second graph means.

As @oshun correctly mentioned the author of the R-blogger post (where these graphs came from) wanted to find a measure (i.e. a number) to compare the "mistakes" of the classification at different points of thresholds. In order to do that and create those measures he did something very intuitive and simple. He counted the false positives and false negatives for different levels of the threshold. The function he used is:

sum(df$pred >= threshold & df$survived == 0) * cost_of_fp + #false positives
sum(df$pred <  threshold & df$survived == 1) * cost_of_fn   #false negatives

I deliberately split the above in two lines. The first line counts the false positives (prediction >= threshold means the algorithm classified the passenger as survived but in reality they didn't - i.e. survived equals 0). The second line does the same thing but counts the false negatives (i.e. those that were predicted as not survived but in reality they did).

Now that leaves us to what cost_of_fp and what cost_of_fn are. These are nothing more than weights and are set arbitrarily by the user. In the example above the author used cost_of_fp = 1 and cost_of_fn = 3. This just means that as far as the cost function is concerned a false negative is 3 times more important than a false positive. So, in the cost function any false negative is just multiplied by 3 in order to increase the number of false positives + false negatives (which is the result of the cost function).

To sum up, the y-axis in the graph above is just:

false_positives * weight_fp + false_negatives * weight_fn

for every value of the threshold (which is used to calculate the false_positives and false_negatives).

I hope this is clear now.