InvalidArgumentError when running on cluster

Question

InvalidArgumentError when running on cluster

kakawait opened this issue 7 years ago · comments

Demo is successfully working when using local but when I try to execute remotely (to take advantage of GPU)

tf.Session("grpc://HOSTNAME:2222")

I have the following error when running mnist_2.0_five_layers_sigmoid.py

Caused by op 'Variable_1/Assign', defined at:
  File "mnist_2.0_five_layers_sigmoid.py", line 51, in <module>
    B1 = tf.Variable(tf.zeros([L]))
  File "/Volumes/Users/<USERNAME>/.tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 226, in __init__
    expected_shape=expected_shape)
  File "/Volumes/Users/<USERNAME>/.tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 334, in _init_from_args
    validate_shape=validate_shape).op
  File "/Volumes/Users/<USERNAME>/.tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "/Volumes/Users/<USERNAME>/.tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/Volumes/Users/<USERNAME>/.tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Volumes/Users/<USERNAME>/.tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [10] rhs shape= [200]
	 [[Node: Variable_1/Assign = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:worker/replica:0/task:0/gpu:0"](Variable_1, zeros)]]

Update I fixed InvalidArgumentError from mnist_1.0_softmax.py by upgrading server python version tensorflow/tensorflow:latest-gpu -> tensorflow/tensorflow:latest-gpu-py3

Thibaud Lepretre · Answer 1 · Tue Feb 21 2017 00:22:42 GMT+0800 (China Standard Time)

Hum never mind is about cluster. When I restart server everything is OK. Just it seems that cluster does not really work well with multiple training scripts...