pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.

Home Page:https://pytorch.org/examples

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: Tensors must be CUDA and dense

JoohyungLee0106 opened this issue · comments

Your issue may already be reported!
Please search on the issue tracker before creating one.

Context

  • Pytorch version: 1) 1.8.2 (lts), 2) 1.12.0. Both version outputs the same error msg
  • Operating System and version: Ubuntu 18.04.4 LTS
  • Python: 3.7.13
  • CUDA: 10.2
  • CUDNN: 7.6.5
  • GPU: TITAN V * 8

Your Environment

  • Installed using source? [yes/no]: no. conda
  • Are you planning to deploy it using docker container? [yes/no]: no. I am NOT using any container
  • Is it a CPU or GPU environment?: GPU (TITAN V * 8), CUDA 10.2, CUDNN 7.6.5
  • Which example are you using:
https://github.com/pytorch/examples/blob/main/imagenet/main.py
  • Link to code or data to repro [if any]:
    LINK

Expected Behavior

It should run

Current Behavior

It outputs:

RuntimeError: Tensors must be CUDA and dense

when it computes
https://github.com/pytorch/examples/blob/main/imagenet/main.py#L419

Possible Solution

Steps to Reproduce

  1. Edit https://github.com/pytorch/examples/blob/main/imagenet/main.py to compute CIFAR-10 [LINK]
  2. Run
 python main.py --dist-url 'tcp://xxx.xxx.xxx.xx:23456' --multiprocessing-distributed --world-size 1 --rank 0

...

Failure Logs [if any]

Traceback (most recent call last):
  File "main_spawn.py", line 490, in <module>
    main()
  File "main_spawn.py", line 119, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 265, in main_worker
    acc1 = validate(val_loader, model, criterion, args)
  File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 376, in validate
    top1.all_reduce()
  File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 425, in all_reduce
    dist.all_reduce(total, dist.ReduceOp.SUM, async_op=True)
  File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1169, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense