tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation

Question

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation

kaenkogashi opened this issue 4 years ago · comments

Dear, sir

Thank you for your works!

I try to train VCL on V-COCO as following instructions.

Train an VCL on V-COCO
python tools/Train_VCL_ResNet_VCOCO.py --model VCL_union_multi_ml1_l05_t3_rew_aug5_3_new_VCOCO_test --num_iteration 400000

I only assigned 1 GPU for training and I got error messages as below, would you help me to solve with this?
I don't know why I am try to training on V-COCO, but the error is about HICO.

Traceback (most recent call last):
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
self._extend_graph()
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation HICO_0/MatMul: {{node HICO_0/MatMul}}was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:CPU:1, /job:localhost/replica:0/task:0/device:CPU:10, /job:localhost/replica:0/task:0/device:CPU:11, /job:localhost/replica:0/task:0/device:CPU:12, /job:localhost/replica:0/task:0/device:CPU:13, /job:localhost/replica:0/task:0/device:CPU:14, /job:localhost/replica:0/task:0/device:CPU:15, /job:localhost/replica:0/task:0/device:CPU:2, /job:localhost/replica:0/task:0/device:CPU:3, /job:localhost/replica:0/task:0/device:CPU:4, /job:localhost/replica:0/task:0/device:CPU:5, /job:localhost/replica:0/task:0/device:CPU:6, /job:localhost/replica:0/task:0/device:CPU:7, /job:localhost/replica:0/task:0/device:CPU:8, /job:localhost/replica:0/task:0/device:CPU:9, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[HICO_0/MatMul]]

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/Train_VCL_ResNet_VCOCO.py", line 109, in
sw.train_model(sess, args.max_iters)
File "/home/kogashi/VCL/tools/../lib/models/train_Solver_VCOCO_MultiGPU.py", line 153, in train_model
self.from_snapshot(sess)
File "/home/kogashi/VCL/tools/../lib/models/train_Solver_VCOCO.py", line 134, in from_snapshot
sess.run(tf.global_variables_initializer())
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation HICO_0/MatMul: node HICO_0/MatMul (defined at /home/kogashi/VCL/tools/../lib/networks/ResNet50_VCOCO.py:150) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:CPU:1, /job:localhost/replica:0/task:0/device:CPU:10, /job:localhost/replica:0/task:0/device:CPU:11, /job:localhost/replica:0/task:0/device:CPU:12, /job:localhost/replica:0/task:0/device:CPU:13, /job:localhost/replica:0/task:0/device:CPU:14, /job:localhost/replica:0/task:0/device:CPU:15, /job:localhost/replica:0/task:0/device:CPU:2, /job:localhost/replica:0/task:0/device:CPU:3, /job:localhost/replica:0/task:0/device:CPU:4, /job:localhost/replica:0/task:0/device:CPU:5, /job:localhost/replica:0/task:0/device:CPU:6, /job:localhost/replica:0/task:0/device:CPU:7, /job:localhost/replica:0/task:0/device:CPU:8, /job:localhost/replica:0/task:0/device:CPU:9, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[HICO_0/MatMul]]

Errors may have originated from an input operation.
Input Source operations connected to node HICO_0/MatMul:
IteratorGetNext (defined at /home/kogashi/VCL/tools/../lib/ult/ult.py:884)
HICO_0/Const (defined at /home/kogashi/VCL/tools/../lib/networks/ResNet50_VCOCO.py:148)

Zhi Hou · Answer 1 · Mon Sep 07 2020 19:54:12 GMT+0800 (China Standard Time)

Thanks for your interest.

The information in "HICO" is my mistake. It is because I first evaluate VCL on HICO-DET dataset and I did not change the variable name "HICO" to "HOI". This is just the scope name/variable name.

According your log information. I guess it's because your GPU device name is "XLA_GPU". Thus

 [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:CPU:1, /job:localhost/replica:0/task:0/device:CPU:10, /job:localhost/replica:0/task:0/device:CPU:11, /job:localhost/replica:0/task:0/device:CPU:12, /job:localhost/replica:0/task:0/device:CPU:13, /job:localhost/replica:0/task:0/device:CPU:14, /job:localhost/replica:0/task:0/device:CPU:15, /job:localhost/replica:0/task:0/device:CPU:2, /job:localhost/replica:0/task:0/device:CPU:3, /job:localhost/replica:0/task:0/device:CPU:4, /job:localhost/replica:0/task:0/device:CPU:5, /job:localhost/replica:0/task:0/device:CPU:6, /job:localhost/replica:0/task:0/device:CPU:7, /job:localhost/replica:0/task:0/device:CPU:8, /job:localhost/replica:0/task:0/device:CPU:9, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]

Thus, in your device list, there is no device "/device:GPU:0". I guess renaming the device name to "/device:XLA_GPU:0" might solve your problem.

change line 66 in lib/models/train_Solver_VCOCO_MultiGPU.py

with tf.device('/gpu:%d' % gpu_idx):

to

with tf.device('/XLA_GPU:%d' % gpu_idx):

or
with tf.device('/device:XLA_GPU:%d' % gpu_idx):

You can find this issue in tensorflow.

emmm, It is also ok you remove all "tf.device()" in lib/models/train_Solver_VCOCO_MultiGPU.py if you just use one GPU. It will use the default device.

If you have further problems, feel free to discuss it.

kaenkogashi · Answer 2 · Tue Sep 08 2020 18:37:19 GMT+0800 (China Standard Time)

Thank you for your replay!

I removed all "tf.device()" in lib/models/train_Solver_VCOCO_MultiGPU.py because I use one GPU. (Actually, I tried other solutions like change line 66 in lib/models/train_Solver_VCOCO_MultiGPU.py
with tf.device('/gpu:%d' % gpu_idx):
to
with tf.device('/XLA_GPU:%d' % gpu_idx):
or
with tf.device('/device:XLA_GPU:%d' % gpu_idx):) ,but there are still GPU allocate errors.)

But in this time, gpu didn't work. Instead of using gpu, cpu worked.
I installed tensorflow-gpu version with pip.
I don't know what to do. Sorry for the basic questions, I am not familiar with tensorflow. (I always use pytorch)
If you come up with any other solutions, please teach me. thank you very much!

Zhi Hou · Answer 3 · Tue Sep 08 2020 19:13:48 GMT+0800 (China Standard Time)

I also meet the similar problem. But I have forgotten the solution. I find someone solve it like this in tensorflow/tensorflow#30748 (comment)

I met the same problem on ubuntu 18.04, cuda 10.1 and Tensorflow 1.14.0. However, I uninstalled the pip version tensorflow using pip uninstall tensorflow-gpu and then use conda install -c anaconda tensorflow-gpu to install conda version, and it works for me. You can have a try.

Hope help you.

Zhi Hou · Answer 4 · Tue Sep 08 2020 19:28:38 GMT+0800 (China Standard Time)

btw, do you also remove "tf.device('/cpu:0')" in line 44? If so, your tensorflow possibly has some problems. try to install tensorflow-gpu==1.14.0 by conda

kaenkogashi · Answer 5 · Wed Sep 09 2020 09:08:49 GMT+0800 (China Standard Time)

thank you for your help!

I installed tensorflow-gpu==1.14.0 with conda. (I uninstalled pip version), and I used the original code. I didn't change code at all.
Then, CUDA not found error comes out.

tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at xla_ops.cc:463 : Not found: ./libdevice.compute_20.10.bc not found
Traceback (most recent call last):
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
[[cluster_1_1/merge_oidx_4/_873]]
(1) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/Train_VCL_ResNet_VCOCO.py", line 109, in
sw.train_model(sess, args.max_iters)
File "/home/kogashi/VCL/tools/../lib/models/train_Solver_VCOCO_MultiGPU.py", line 171, in train_model
train_op])
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
[[cluster_1_1/merge_oidx_4/_873]]
(1) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
0 successful operations.
0 derived errors ignored.

I renamed /home/kogashi/miniconda3/cuda-10.1/nvvm/libdevice/libdevice.10.bc
to
/home/kogashi/miniconda3/cuda-10.1/nvvm/libdevice/libdevice.compute_20.10.bc
But there is still NotFoundError. I am wondering where tensorflow looking at???
I googled several web-site, but still can't find the answer. (I am using cuda-10.1, but currently the error is NotFoundError. If cuda-10.1 is the wrong version, please let me know)
Thank you very much!

Zhi Hou · Answer 6 · Wed Sep 09 2020 09:55:07 GMT+0800 (China Standard Time)

Do you have multiple cuda? Have you defined CUDA_DIR env var? From the message in google/jax#989, this problem seems like tensorflow can not find the cuda dir. Someone trys to set CUDA_DIR or add the symlink did (eg.: $ sudo ln -s /opt/cuda /usr/local/cuda-10.2), or set "XLA_FLAGS=--xla_gpu_cuda_data_dir=conda-env-path/lib/"

kaenkogashi · Answer 7 · Wed Sep 09 2020 12:03:47 GMT+0800 (China Standard Time)

Thank you very much! After I made cuda's symbolic link, finally it worked!
Would you tell me how many hours this model will take for training on V-COCO and HICO datasets?
And on my GPU, your model looks like don't use much GPU-power, but need a lot of memory and CPU power.

Zhi Hou · Answer 8 · Wed Sep 09 2020 12:45:21 GMT+0800 (China Standard Time)

V-COCO converges at around iteration 300000. HICO converges at around iteration 500000. The time this model will take depend on your GPU.

V-COCO needs less 24 hours on 2080Ti and HICO requires around 48 hours. If your decrease the learning rate on V-COCO more quickly, I guess it will converge earlier.

On 2080Ti, each iteration will consume 0.2s on HICO. On Titan XP, each iteration consumes around 0.25-0.3 on HICO. If you training speed is still too slow after 1000 iteration, I guess it might have some problems.

Yeah, it needs GPU memory because we input two images.

All numbers above are based on res50 backbone.

kaenkogashi · Answer 9 · Wed Sep 09 2020 14:54:58 GMT+0800 (China Standard Time)

Thank you for your reply!

Current I am training on V-COCO dataset. It is slow because it still takes 2.107 each iteration after 1000 iteration.
My GPU have 16G memory(Tesla V100). So, shall I use multi-GPU rather than single -GPU?

iter: 4910 / 400000, im_id: 347655, total loss: nan, lr: 0.010000, speed: 2.107 s/iter/iter

Is this based on single-GPU? (I thought probably you used multi-GPU )

V-COCO needs less 24 hours on 2080Ti and HICO requires around 48 hours.

Zhi Hou · Answer 10 · Wed Sep 09 2020 15:46:56 GMT+0800 (China Standard Time)

Yes, All the experiments are based on single-GPU because I find two gpus have bugs and are slower. Well, I also tested the code for issue #4 with V100 last week. Here (https://github.com/zhihou7/VCL/files/5175383/test.txt) is the log. It is much faster than the experiment with 2080Ti.

Do you install scikit-image?

scikit-image 0.14.2

I remember the version of scikit-image will affect the speed seriously. I use 0.14.2.

It also might be because the first running is slow. It is wired.

kaenkogashi · Answer 11 · Wed Sep 09 2020 16:40:36 GMT+0800 (China Standard Time)

Thank you for your comment!

I installed scikit-image, but the version was different, so I uninstalled old one and installed scikit-image 0.14.2.
I am restart training model and I will let you know the result later. Thank you very much!

Zhi Hou · Answer 12 · Wed Sep 09 2020 16:47:50 GMT+0800 (China Standard Time)

In fact, I do not use the scikit-image in my code. I just forget to remove "import skimage". I'm not sure why the code runs slow in some environments.

kaenkogashi · Answer 13 · Thu Sep 10 2020 10:53:48 GMT+0800 (China Standard Time)

Thank you for your comment!

I found the reason for the slow training. Because other people use CPU-power heavily in our server.
And VCL's problem also use CPU-power heavily. That was the reason.

iter: 4910 / 400000, im_id: 347655, total loss: nan, lr: 0.010000, speed: 2.107 s/iter/iter

Zhi Hou · Answer 14 · Thu Sep 10 2020 11:04:48 GMT+0800 (China Standard Time)

Thanks for your comment! I also face this problem that VCL will use CPU-power largely in some machines. But in other machines, it then looks normal. In our GPU cluster, I usually allocate 1GPU and 4-CPU and VCL begins normally. It might also depend on IO load.

kaenkogashi · Answer 15 · Thu Sep 10 2020 12:04:04 GMT+0800 (China Standard Time)

Unfortunately, our server's cpu is always busy. I think probably I can load memory on the GPU, not on CPU??
I will write pytorch implementation of VCL! (but I need to read the whole code first, haha)
Anyway, I think we can close this topic, thank you for your help!

Zhi Hou · Answer 16 · Thu Sep 10 2020 12:13:16 GMT+0800 (China Standard Time)

OK，I also want to implement this in pytorch. But I donot find suitable open source pytorch code or I can not reproduce the reported performance. Our core code is in VCL.py. Current implementation is worse.

kaenkogashi · Answer 17 · Thu Sep 10 2020 12:51:00 GMT+0800 (China Standard Time)

I see, after I finished pytorch implementation( probably not based on iCAN). I will contact you. But from your opinion, maybe a hard work. haha!

kaenkogashi · Answer 18 · Thu Sep 10 2020 17:53:39 GMT+0800 (China Standard Time)

@zhihou7

Current implementation is worse.

I watch at the code, some library is originate in tensorflow(like Res5 blocks), pytorch don't have those libraries. Maybe this is the reason why we can't reproduce the performance.

Would you provide your core code VCL.py or other modules in pytorch version?
I am going to implement VCL, maybe your pytorch code can't reproduce performance, but it is still faster for me to write from scratch.
hope hearing from you soon!

Zhi Hou · Answer 19 · Thu Sep 10 2020 18:03:47 GMT+0800 (China Standard Time)

Well, I have not begin to implement VCL in pytorch. I plan to implement it based on https://github.com/ASMIftekhar/VSGNet or https://github.com/vt-vl-lab/DRG. DRG is based on iCAN that is the base code of my released code. If we want to simply reimplement VCL in pytorch, DRG (appearance only branch) possibly is a good choice. But DRG is a ensemble of three model (very weird). I can only obtain around 12% mAP with appearance only model in DRG, far worse than reported.

kaenkogashi · Answer 20 · Thu Sep 10 2020 18:12:10 GMT+0800 (China Standard Time)

Thank you ! I have some task in hurry today, I will look at the code you provided tomorrow!