Feras Feras - 14 days ago 9
Python Question

How to avoid that Theano computing gradient going toward NaN

I'm tryining CNN with 5 convolution layers - 2 Hidden layers - 1 Softmax.

the architecture is:

cv0->relu->cv1->relu-cv2->relu->cv3->relu->cv4->relu->cv5->hid1->relu->hid2->relu->logistic softmax


by applying stochastic gradient with 66 patches token from an image. the training was applied only on single image with 20 epochs for testing purpose.

what is recognized from the network that error is exploding in each iteration so the gradient is computing nan after 3rd of 4th epoch.


  • epoch 1 learning cost:
    4.702012

  • epoch 2 learning cost:
    45338036.000000

  • epoch 3 learning cost:
    74726722389225987403008805175296.000000

  • epoch 4 learning cost:
    nan



As you can see after the error was exploded into very high value the gradient produced nan which was propagated into all the network.

looking at a single node from different layers weights values to see what happened:

layer8 (softmax):


  • Intiali value
    [ 0.05436778 0.02379715]

  • epoch 1
    [ 0.28402206 -0.20585714]

  • epoch 2
    [ -5.27361184e-02 9.52038541e-02]

  • epoch 3
    [-7330.04199219 7330.12011719]

  • epoch 4
    [ nan nan]



layer6 (hid1):


  • Intiali value
    [-0.0254469 0.00760095 ..., -0.00587915 0.02619855 0.03809309]

  • epoch 1
    [-0.0254469 0.00760095 ..., -0.00587915 0.02619855 0.03809309]

  • epoch 2
    [-0.0254469 0.00760095 ..., -0.00587915 0.02619855 0.03809309

  • epoch 3
    [ -2.54468974e-02 1.79247314e+16 ..., -5.87915350e-03 2.61985492e-02 -2.06307964e+19]

  • epoch 4
    [ nan nan ..., nan nan nan]



layer0 (cv0):

on initialization is

[[-0.01704694 -0.01683052 -0.0894756 ]
[ 0.12275343 -0.05518051 -0.09202443]
[-0.11599202 -0.04718829 -0.04359322]]


while on the 3rd epoch is

[[-24165.15234375 -26490.89257812 -24820.1484375 ]
[-27381.8203125 -26653.3359375 -24762.28710938]
[-23120.56835938 -21189.44921875 -24513.65039062]]


it is clear that weights values are exploding.

The learning rate is 0.01 So in order to solve this issue I changed the learning rate to 0.001 and Nan disappears sometimes and the network converge and sometimes not and the network saturated with NaN. Again tried smaller learning rate with 0.0001 and I didn't see the NaN yet. What I see from the results that I have every time I re-run the code the results are really different which I think it is related in the first place to the weights initialization.

So I tried different weights initialization:

for the Conv layer with relu

W_bound_6 = numpy.sqrt(6. / (fan_in + fan_out))
W_bound_2 = numpy.sqrt(2. / (fan_in + fan_out))
W_values = numpy.asarray(
numpy.random.randn(filter_shape[0], filter_shape[1], filter_shape[2], filter_shape[3]) * W_bound_2,
dtype=theano.config.floatX)


and for the hidden layer and softamx layer

W_bound_2 = numpy.sqrt(2. / (filter_shape[0] + filter_shape[1]))
W_values = numpy.asarray(
numpy.random.randn(filter_shape[0], filter_shape[1]) * W_bound_2,
dtype=theano.config.floatX
)


and initiating the b all to zeros.

the difference is not that big and I still don't see different in the results.

I'm posting my question here to:


  • disccuess if what I'm doing regarding the weights initialization is
    correct with the coding

  • To see if we I can avoid making the learning rate very small and keep it high at least at the first few iteration because in my case it was propogating Nan in the 4th iteration.

  • I want to know is the L1,L2 regularization since I'm using Theano where should I implement the code in the cost function or I should change the update function.



cost function

-T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])


Update function

updates = [
(param_i, param_i - learning_rate * grad_i)
for param_i, grad_i in zip(classifier.params, grads)
]



  • Is it correct the relu implementation in my structure after each layer but not in the softmax?


Answer

I was looking into different way to avoid this problem but looking for a formal solution proposed by others and reading some theorical solution I'll write my answer here in order to help others having the same issue.

the reason behind this problem is by using the softmax and crossentropy. So when you are computing the gradient and diving by zero or inf you are getting nan which is propagating backword throw all network parameters.

few advises to avoid this problem

  • if error starts increasing then NaN appears afterwards: diverging due to too high learning rate
  • if NaNs appear suddenly: saturating units yielding non-differentiable gradient NaN computation due to log(0)
  • NaN due to floating point issues (to high weights) or activations on the output 0/0, inf/inf, inf*weight...

solutions:

  1. reduce learning rate
  2. Change the Weight initialization
  3. Use L2 norm
  4. Safe softmax (small value add to log(x))
  5. gradient clipping

In my case learning rate solved the issue but I'm still working to optimize it more