marcman - 1 year ago 217
C++ Question

Caffe SigmoidCrossEntropyLoss Layer Loss Function

I was looking through the code of Caffe's SigmoidCrossEntropyLoss layer and the docs and I'm a bit confused. The docs list the loss function as the logit loss (I'd replicate it here, but without Latex, the formula would be difficult to read. Check out the docs link, it's at the very top).

However, the code itself (

`Forward_cpu(...)`
) shows a different formula

``````Dtype loss = 0;
for (int i = 0; i < count; ++i) {
loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
}
top[0]->mutable_cpu_data()[0] = loss / num;
``````

Is it because this is accounting for the sigmoid function having already been applied to the input?

However, even so, the
`(input_data[i] >= 0)`
snippets are confusing me as well. Those appear to be in place of the p_hat from the loss formula in the docs, which is supposed to be the prediction squashed by the sigmoid function. So why are they just taking a binary threshold? It's made even more confusing as this loss predicts [0,1] outputs, so
`(input_data[i] >= 0)`
will be a
`1`
unless it's 100% sure it's not.

Can someone please explain this to me?

The `SigmoidCrossEntropy` layer in caffe combines 2 steps(`Sigmoid` + `CrossEntropy`) that will perform on `input_data` into one piece of code :

``````Dtype loss = 0;
for (int i = 0; i < count; ++i) {
loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
}
top[0]->mutable_cpu_data()[0] = loss / num;
``````

In fact, no matter whether `input_data >= 0` or not, the above code is always equivalent to the following code in math:

``````Dtype loss = 0;
for (int i = 0; i < count; ++i) {
loss -= input_data[i] * (target[i] - 1) -
log(1 + exp(-input_data[i]);
}
top[0]->mutable_cpu_data()[0] = loss / num;
``````

, this code is based on the straightforward math formula after applying `Sigmoid` and `CrossEntropy` on `input_data` and making some combination in math.

But the first piece of code(caffe uses) owns more numerical stability and takes less risk of overflow, because it avoids calculating a large `exp(input_data)`(or `exp(-input_data)`) when the absolute value of `input_data` is too large. That's why you saw that code in caffe.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download