I understood that
tf.train.replica_device_setter() is quite simple in its behavior: it makes a purely local decision to assign a device to each
tf.Variable as it is created—in a round-robin fashion across the parameter server tasks.
In the distributed version of TensorFlow, each device (e.g.
"/job:ps/task:17/cpu:0") maintains a map from variable names to variables that is shared between all sessions that use this device. This means that when different worker replicas create a session using that device, if they assign the same symbolic variable (having the same
Variable.name property) to the same device, they will see each other's updates.
When you do "between-graph replication" across multiple replicas, the
tf.train.replica_device_setter() provides a simple, deterministic way to assign variables to devices. If you build an identical graph on each worker replica, each variable will be assigned to the same device and successfully shared, without any external coordination.
Caveat: With this scheme, your worker replicas must create an identical graph*, and there must be no randomness in how the graph is constructed. I once saw an issue where the order of creating variables was determined by iterating over the keys of a Python
dict, which is not guaranteed to happen in the same order across processes. This led to variables being assigned to different PS devices by different workers....
As to your other question, you do need to be careful about variable name clashes when training multiple models using the same processes. By default all variables are shared in a global namespace, so two variables from different networks with the same name will clash. One way to mitigate this problem is to wrap each model in a
with tf.container(name): block (with different values for
to put your variables in a different namespace, which is called a "container" in the TensorFlow jargon. You can think of a container as a prefix that is added to the name of all of your variables when they are looked up on the device. The support for containers in the API is still quite preliminary, but there are plans to make them more useful in future.
* Technically, they only need to create their
tf.Variable objects in the same sequence.