Recently I've been toying with TensorFlow and I mentioned that the framework is not able to use all my available computational resources. In Convolutional Neural Networks tutorial they mention that
Naively employing asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, employing fully synchronous updates will be as slow as the slowest model replica.
Asynchronous gradient descent is supported in the open-source release of TensorFlow, without even modifying your graph. The easiest way to do it is to execute multiple concurrent steps in parallel:
loss = ... # Any of the optimizer classes can be used here. train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss) sess = tf.Session() sess.run(tf.initialize_all_variables()) def train_function(): # TODO: Better termination condition, e.g. using a `max_steps` counter. while True: sess.run(train_op) # Create multiple threads to run `train_function()` in parallel train_threads =  for _ in range(NUM_CONCURRENT_STEPS): train_threads.append(threading.Thread(target=train_function)) # Start the threads, and block on their completion. for t in train_threads: t.start() for t in train_threads: t.join()
This example sets up
NUM_CONCURRENT_STEPS calls to
sess.run(train_op). Since there is no coordination between these threads, they proceed asynchronously.
It's actually more challenging to achieve synchronous parallel training (at present), because this requires additional coordination to ensure that all replicas read the same version of the parameters, and that all of their updates become visible at the same time. The multi-GPU example for CIFAR-10 training performs synchronous updates by making multiple copies of the "tower" in the training graph with shared parameters, and explicitly averaging the gradients across the towers before applying the update.
N.B. The code in this answer places all computation on the same device, which will not be optimal if you have multiple GPUs in your machine. If you want to use all of your GPUs, follow the example of the multi-GPU CIFAR-10 model, and create multiple "towers" with their operations pinned to each GPU. The code would look roughly as follows:
train_ops =  for i in range(NUM_GPUS): with tf.device("/gpu:%d" % i): # Define a tower on GPU `i`. loss = ... train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss)) def train_function(train_op): # TODO: Better termination condition, e.g. using a `max_steps` counter. while True: sess.run(train_op) # Create multiple threads to run `train_function()` in parallel train_threads =  for train_op in train_ops: train_threads.append(threading.Thread(target=train_function, args=(train_op,)) # Start the threads, and block on their completion. for t in train_threads: t.start() for t in train_threads: t.join()
Note that you might find it convenient to use a "variable scope" to facilitate variable sharing between the towers.