RuntimeError: Tensors must be CUDA and dense
JoohyungLee0106 opened this issue · comments
Chris (이주형) commented
Your issue may already be reported!
Please search on the issue tracker before creating one.
Context
- Pytorch version: 1) 1.8.2 (lts), 2) 1.12.0. Both version outputs the same error msg
- Operating System and version: Ubuntu 18.04.4 LTS
- Python: 3.7.13
- CUDA: 10.2
- CUDNN: 7.6.5
- GPU: TITAN V * 8
Your Environment
- Installed using source? [yes/no]: no. conda
- Are you planning to deploy it using docker container? [yes/no]: no. I am NOT using any container
- Is it a CPU or GPU environment?: GPU (TITAN V * 8), CUDA 10.2, CUDNN 7.6.5
- Which example are you using:
https://github.com/pytorch/examples/blob/main/imagenet/main.py
- Link to code or data to repro [if any]:
LINK
Expected Behavior
It should run
Current Behavior
It outputs:
RuntimeError: Tensors must be CUDA and dense
when it computes
https://github.com/pytorch/examples/blob/main/imagenet/main.py#L419
Possible Solution
Steps to Reproduce
- Edit https://github.com/pytorch/examples/blob/main/imagenet/main.py to compute CIFAR-10 [LINK]
- Run
python main.py --dist-url 'tcp://xxx.xxx.xxx.xx:23456' --multiprocessing-distributed --world-size 1 --rank 0
...
Failure Logs [if any]
Traceback (most recent call last):
File "main_spawn.py", line 490, in <module>
main()
File "main_spawn.py", line 119, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 265, in main_worker
acc1 = validate(val_loader, model, criterion, args)
File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 376, in validate
top1.all_reduce()
File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 425, in all_reduce
dist.all_reduce(total, dist.ReduceOp.SUM, async_op=True)
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1169, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
Chris (이주형) commented
solved by #1031