Paul Paul - 2 months ago 7
Python Question

How does "tf.train.replica_device_setter" work?

I understood that

tf.train.replica_device_setter
can be used for automatically assigning the variables always on the same parameter server (PS) (using round-robin) and the compute intensive nodes on one worker.

How do the same variables get reused across multiple graph replicas, build up by different workers? Does the parameter server only look at the name of the variable that a worker asks for?

Does this mean that tasks should not be used parallel for the execution of two different graphs if in both graphs the variables are named the same?

Answer

The tf.train.replica_device_setter() is quite simple in its behavior: it makes a purely local decision to assign a device to each tf.Variable as it is created—in a round-robin fashion across the parameter server tasks.

In the distributed version of TensorFlow, each device (e.g. "/job:ps/task:17/cpu:0") maintains a map from variable names to variables that is shared between all sessions that use this device. This means that when different worker replicas create a session using that device, if they assign the same symbolic variable (having the same Variable.name property) to the same device, they will see each other's updates.

When you do "between-graph replication" across multiple replicas, the tf.train.replica_device_setter() provides a simple, deterministic way to assign variables to devices. If you build an identical graph on each worker replica, each variable will be assigned to the same device and successfully shared, without any external coordination.

Caveat: With this scheme, your worker replicas must create an identical graph*, and there must be no randomness in how the graph is constructed. I once saw an issue where the order of creating variables was determined by iterating over the keys of a Python dict, which is not guaranteed to happen in the same order across processes. This led to variables being assigned to different PS devices by different workers....

As to your other question, you do need to be careful about variable name clashes when training multiple models using the same processes. By default all variables are shared in a global namespace, so two variables from different networks with the same name will clash. One way to mitigate this problem is to wrap each model in a with tf.container(name): block (with different values for name, e.g. "model_1" and "model_2") to put your variables in a different namespace, which is called a "container" in the TensorFlow jargon. You can think of a container as a prefix that is added to the name of all of your variables when they are looked up on the device. The support for containers in the API is still quite preliminary, but there are plans to make them more useful in future.


 * Technically, they only need to create their tf.Variable objects in the same sequence.

Comments