sproblvem sproblvem - 11 months ago 147
Python Question

Distributed Tensorflow: ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"

I am trying to write a distributed variational auto encoder on tensorflow in

standalone mode

My cluster includes 3 machines, naming m1, m2 and m3. I am trying to run 1 ps server on m1, and 2 worker servers on m2 and m3. (Example trainer program in distributed tensorflow documentation)
On m3 I got the following error message:

Traceback (most recent call last):
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 241, in <module>
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 334, in __init__
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 863, in _verify_setup
"their device set: %s" % op)
ValueError: When using replicas, all Variables must have their device set: name: "Variable"
op: "Variable"
attr {
key: "container"
value {
s: ""
attr {
key: "dtype"
value {
type: DT_INT32
attr {
key: "shape"
value {
shape {
attr {
key: "shared_name"
value {
s: ""

And this is the part of my code, which defines the network and Supervisor.

if FLAGS.job_name == "ps":
elif FLAGS.job_name == "worker":

#set distributed device
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,

# Build the training computation graph
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
with tf.variable_scope("model") as scope:
with pt.defaults_scope(phase=pt.Phase.train):
train_model = M1(n_z, x_train.shape[1])
train_vz_mean, train_vz_logstd = q_net(x, n_z)
train_variational = ReparameterizedNormal(
train_vz_mean, train_vz_logstd)
grads, lower_bound = advi(
train_model, x, train_variational, lb_samples, optimizer)
infer = optimizer.apply_gradients(grads)

# Build the evaluation computation graph
with tf.variable_scope("model", reuse=True) as scope:
with pt.defaults_scope(phase=pt.Phase.test):
eval_model = M1(n_z, x_train.shape[1])
eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
eval_variational = ReparameterizedNormal(
eval_vz_mean, eval_vz_logstd)
eval_lower_bound = is_loglikelihood(
eval_model, x, eval_variational, lb_samples)
eval_log_likelihood = is_loglikelihood(
eval_model, x, eval_variational, ll_samples)

#saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
global_step = tf.Variable(0)
init_op = tf.initialize_all_variables()

# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
# saver=saver,
print("create sv done")

I think there must be something wrong with my code, but I don't know how to fix it. Any advice? Thanks a lot!

Answer Source

The problem stems from the definition of your global_step variable:

global_step = tf.Variable(0)

This definition is outside the scope of the with tf.device(tf.train.replica_device_setter(...)): block above, so no device is assigned to global_step. In replicated training, this is often a source of error (because if different replicas decide to place the variable on a different device, they won't share the same value), so TensorFlow includes a sanity check that prevents this.

Fortunately, the solution is simple. You can either define global_step inside the with tf.device(tf.train.replica_device_setter(...)): block above, or add a small with tf.device("/job:ps/task:0"): block as follows:

with tf.device("/job:ps/task:0"):
    global_step = tf.Variable(0, name="global_step")