Paul Paul - 4 months ago 114
Python Question

TensorFlow Master and Worker Service

I am trying to understand the exact roles of the master and worker service in TensorFlow.

So far I understand that each TensorFlow task that I start is associated with a

tf.train.Server
instance. This instance exports a "master service" and "worker service" by implementing the
tensorflow::Session
interface" (master) and
worker_service.proto
(worker).

1st Question: Am I right that this means, that ONE task is only associated with ONE worker?




Furthermore, I understood...

...about the Master:
It is the scope of the master service...

(1) ...to offer functions to the client so that the client can run a session for example.

(2) ...to delegate work to the available workers in order to compute a session run.

2nd Question:
In case we execute a graph distributed using more than one task, does only one master service get used?


3rd Question:
Should tf.Session.run get only called once?


This is at least how I interpret this figure from the whitepaper:

enter image description here




... about the Worker:
It is the scope of the worker service...

(1) to execute the nodes (which got delegated to him by the master service) on the devices the worker manages.

4th Question:
How does one worker make use of multiple devices? Does a worker decide automatically how to distribute single operations?





Please also correct me in case I came up with wrong statements!
Thank you in advance!!

Answer

1st Question: Am I right that this means, that ONE task is only associated with ONE worker?

This is the typical configuration, yes. Each tf.train.Server instance contains a full TensorFlow runtime, and the default configuration assumes that this runtime has exclusive access to the machine (in terms of how much memory it allocates on GPUs, etc.).

Note that you can create multiple tf.train.Server instances in the same process (and we sometimes do this for testing). However, there is little resource isolation between these instances, so running multiple instances in a single task is unlikely to yield good performance (with the current version).

2nd Question: In case we execute a graph distributed using more than one task, does only one master service get used?

It depends on the form of replication you are using. If you use "in-graph replication", you can use a single master service that knows about all replicas of the model (i.e. worker tasks). If you use "between-graph replication", you would use multiple master services, each of which knows about a single replica of the model, and is typically colocated with the worker task on which it runs. In general, we have found it more performant to use between-graph replication, and the tf.train.Supervisor library is designed to simplify operation in this mode.

3rd Question: Should tf.Session.run() get only called once?

(I'm assuming this means "once per training step". A simple TensorFlow program for training a model will call tf.Session.run() in a loop.)

This depends on the form of replication you are using, and the coordination you want between training updates.

  • Using in-graph replication, you can make synchronous updates by aggregating the losses or gradients in a single tf.train.Optimizer, which gives a single train_op to run. In this case, you only call tf.Session.run(train_op) once per training step.

  • Using in-graph replication, you make asynchronous updates by defining one tf.train.Optimizer per replica, which gives multiple train_op operations to run. In this case, you typically call each tf.Session.run(train_op[i]) from a different thread, concurrently.

  • Using between-graph replication, you make synchronous updates using the tf.train.SyncReplicasOptimizer, which is constructed separately in each replica. Each replica has its own training loop that makes a single call to tf.Session.run(train_op), and the SyncReplicasOptimizer coordinates these so that the updates are applied synchronously (by a background thread in one of the workers).

  • Using between-graph replication, you make asynchronous updates by using another tf.train.Optimizer subclass (other than tf.train.SyncReplicasOptimizer), using a training loop that similar to the synchronous case, but without the background coordination.

4th Question: How does one worker make use of multiple devices? Does a worker decide automatically how to distribute single operations or...?

Each worker runs the same placement algorithm that is used in single-process TensorFlow. Unless instructed otherwise, the placer will put operations on the GPU if one is available (and there is a GPU-accelerated implementation), otherwise it will fall back to the CPU. The tf.train.replica_device_setter() device function can be used to shard variables across tasks that act as "parameter servers". If you have more complex requirements (e.g. multiple GPUs, local variables on the workers, etc.) you can use explicit with tf.device(...): blocks to assign subgraphs to a particular device.

Comments