I am trying to understand the exact roles of the master and worker service in TensorFlow.
So far I understand that each TensorFlow task that I start is associated with a
1st Question: Am I right that this means, that ONE task is only associated with ONE worker?
This is the typical configuration, yes. Each
tf.train.Server instance contains a full TensorFlow runtime, and the default configuration assumes that this runtime has exclusive access to the machine (in terms of how much memory it allocates on GPUs, etc.).
Note that you can create multiple
tf.train.Server instances in the same process (and we sometimes do this for testing). However, there is little resource isolation between these instances, so running multiple instances in a single task is unlikely to yield good performance (with the current version).
2nd Question: In case we execute a graph distributed using more than one task, does only one master service get used?
It depends on the form of replication you are using. If you use "in-graph replication", you can use a single master service that knows about all replicas of the model (i.e. worker tasks). If you use "between-graph replication", you would use multiple master services, each of which knows about a single replica of the model, and is typically colocated with the worker task on which it runs. In general, we have found it more performant to use between-graph replication, and the
tf.train.Supervisor library is designed to simplify operation in this mode.
3rd Question: Should
tf.Session.run()get only called once?
(I'm assuming this means "once per training step". A simple TensorFlow program for training a model will call
tf.Session.run() in a loop.)
This depends on the form of replication you are using, and the coordination you want between training updates.
Using in-graph replication, you can make synchronous updates by aggregating the losses or gradients in a single
tf.train.Optimizer, which gives a single
train_op to run. In this case, you only call
tf.Session.run(train_op) once per training step.
Using in-graph replication, you make asynchronous updates by defining one
tf.train.Optimizer per replica, which gives multiple
train_op operations to run. In this case, you typically call each
tf.Session.run(train_op[i]) from a different thread, concurrently.
Using between-graph replication, you make synchronous updates using the
tf.train.SyncReplicasOptimizer, which is constructed separately in each replica. Each replica has its own training loop that makes a single call to
tf.Session.run(train_op), and the
SyncReplicasOptimizer coordinates these so that the updates are applied synchronously (by a background thread in one of the workers).
Using between-graph replication, you make asynchronous updates by using another
tf.train.Optimizer subclass (other than
tf.train.SyncReplicasOptimizer), using a training loop that similar to the synchronous case, but without the background coordination.
4th Question: How does one worker make use of multiple devices? Does a worker decide automatically how to distribute single operations or...?
Each worker runs the same placement algorithm that is used in single-process TensorFlow. Unless instructed otherwise, the placer will put operations on the GPU if one is available (and there is a GPU-accelerated implementation), otherwise it will fall back to the CPU. The
tf.train.replica_device_setter() device function can be used to shard variables across tasks that act as "parameter servers". If you have more complex requirements (e.g. multiple GPUs, local variables on the workers, etc.) you can use explicit
with tf.device(...): blocks to assign subgraphs to a particular device.