Sam Hammamy - 1 year ago 409
Python Question

# numpy : calculate the derivative of the softmax function

I am trying to understand

`backpropagation`
in a simple 3 layered neural network with
`MNIST`
.

There is the input layer with
`weights`
and a
`bias`
. The labels are
`MNIST`
so it's a
`10`
class vector.

The second layer is a
`linear tranform`
. The third layer is the
`softmax activation`
to get the output as probabilities.

`Backpropagation`
calculates the derivative at each step and call this the gradient.

Previous layers appends the
`global`
or
`previous`
`local gradient`
. I am having trouble calculating the
`local gradient`
of the
`softmax`

Several resources online go through the explanation of the softmax and its derivatives and even give code samples of the softmax itself

``````def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
``````

The derivative is explained with respect to when
`i = j`
and when
`i != j`
. This is a simple code snippet I've come up with and was hoping to verify my understanding:

``````def softmax(self, x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)

def forward(self):
# self.input is a vector of length 10
# and is the output of
# (w * x) + b
self.value = self.softmax(self.input)

def backward(self):
for i in range(len(self.value)):
for j in range(len(self.input)):
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i))
else:
``````

Then
`self.gradient`
is the
`local gradient`
which is a vector. Is this correct? Is there a better way to write this?

I am assuming you have a 3-layer NN with `W1`, `b1` for is associated with the linear transformation from input layer to hidden layer and `W2`, `b2` is associated with linear transformation from hidden layer to output layer. `Z1` and `Z2` are the input vector to the hidden layer and output layer. `a1` and `a2` represents the output of the hidden layer and output layer. `a2` is your predicted output. `delta3` and `delta2` are the errors (backpropagated) and you can see the gradients of the loss function with respect to model parameters.