neurozen neurozen - 2 months ago 22
Python Question

Debugging Tensorflow hang on global variables initialisation

I'm after advice on how to debug what on Tensorflow is struggling with when it hangs.

I have a multi layer CNN which hangs upon global_variables_initializer() is run in the session. I am getting no errors or messages on the console output.

Is there an intelligent way of debugging what Tensorflow is struggling with when it hangs instead of repeatedly commenting out lines of code that makes the graph, and re-running to see where it hangs. Would TensorFlow debugger (tfdbg) help? What options do I have?

Ideally it would be great to just to break current execution and look at some stack or similar to see where the execution is hanging during the init.

I'm currently running Tensorflow 0.12.1 with Python 3 inside a Jupiter notebook.

Answer Source

I managed to solve the problem. The tip from @amo-ej1 to run in a regular file was a step in the correct direction. This uncovered that the tensor flow process was killing itself off with a SIGKILL and returning an error code of 137.

I tried Tensorflow Debugger tfdbg though this did not provide any further details as the problem was the graph did not initialize. I started to think the graph structure was incorrect, so I dumped out the graph structure using: tf.summary.FileWriter('./logs/traing_graph', graph)

I then used up Tensorboard to inspect the resultant summary graph structure data dumped out the the directory and found that the tensor dimensions of the Fully Connected layer was wrong , having a width of 15million !!?! (wrong)

It turned out that one of the configurable parameters of the graph was incorrect. It was picking the dimension of the layer 2 tensor shape incorrectly from an incorrect addressing the previous tf.shape type property and it exploded the dimensions of the graph.

There were no OOM error messages in /var/log/system.log so I am unsure why the graph initialisation caused the python tensorflow script process to die.

I fixed the dimensions of the graph and graph initialization worked just fine!

My top tip is visualise your graph with Tensorboard before initialisation and training to do a quick check the resultant graph structure you coded it what you expected it to be. You probably will save yourself a lot of time! :-)