Paul Paul - 1 month ago 7x
Python Question

How many session objects are needed for synchronous in-graph replication?

When using synchronous in-graph replication I only call

Question 1: Do I still have to create a new session object for each worker and have to pass the URL of the master server (the one that calls
) as the session target?

Question 2: Can I get the session target for each server by using
or do I have to specify the URL of the master server specifically?


If you are using "in-graph replication", the graph contains multiple copies of the computational nodes, typically with one copy per device (i.e. one per worker task if you're doing distributed CPU training, or one per GPU if you're doing distributed or local multi-GPU training). Since all of the replicas are in the same graph, you only need one tf.Session to control the entire training process. You don't need to create tf.Session objects in the workers that don't call

For in-graph training, it's typical to have a single master that is separate from the worker tasks (for performance isolation), but you could colocate it with your client program. In that case, you could simply create a single-task job called "client" and in that task create a session using The following example shows how you could write a single script for your "client", "worker", and "ps" jobs:

server = tf.train.Server({"client": ["client_host:2222"],
                          "worker": ["worker_host0:2222", ...],
                          "ps": ["ps_host0:2222", ...]})

if job_name == "ps" or job_name == "worker":

elif job_name == "client":
    # Build a replicated graph.
    # ...

    sess = tf.Session(

    # Insert training loop here.
    # ...