marcman marcman - 1 month ago 24
C++ Question

Caffe SigmoidCrossEntropyLoss Layer Loss Function

I was looking through the code of Caffe's SigmoidCrossEntropyLoss layer and the docs and I'm a bit confused. The docs list the loss function as the logit loss (I'd replicate it here, but without Latex, the formula would be difficult to read. Check out the docs link, it's at the very top).

However, the code itself (

Forward_cpu(...)
) shows a different formula

Dtype loss = 0;
for (int i = 0; i < count; ++i) {
loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
}
top[0]->mutable_cpu_data()[0] = loss / num;


Is it because this is accounting for the sigmoid function having already been applied to the input?

However, even so, the
(input_data[i] >= 0)
snippets are confusing me as well. Those appear to be in place of the p_hat from the loss formula in the docs, which is supposed to be the prediction squashed by the sigmoid function. So why are they just taking a binary threshold? It's made even more confusing as this loss predicts [0,1] outputs, so
(input_data[i] >= 0)
will be a
1
unless it's 100% sure it's not.

Can someone please explain this to me?

Answer

The SigmoidCrossEntropy layer in caffe combines 2 steps(Sigmoid + CrossEntropy) that will perform on input_data into one piece of code :

Dtype loss = 0;
for (int i = 0; i < count; ++i) {
    loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
        log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
}
top[0]->mutable_cpu_data()[0] = loss / num;

In fact, no matter whether input_data >= 0 or not, the above code is always equivalent to the following code in math:

Dtype loss = 0;
for (int i = 0; i < count; ++i) {
    loss -= input_data[i] * (target[i] - 1) -
        log(1 + exp(-input_data[i]);
}
top[0]->mutable_cpu_data()[0] = loss / num;

, this code is based on the straightforward math formula after applying Sigmoid and CrossEntropy on input_data and making some combination in math.

But the first piece of code(caffe uses) owns more numerical stability and takes less risk of overflow, because it avoids calculating a large exp(input_data)(or exp(-input_data)) when the absolute value of input_data is too large. That's why you saw that code in caffe.