Megvii-BaseDetection / cvpods

All-in-one Toolbox for Computer Vision Research.

Home Page:https://cvpods.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to run training with a single gpu

Dmytro-Shvetsov opened this issue · comments

I am trying to launch training of any of the YOLOF models. However when I run
pods_train --num-gpus 1 --num-machines 1
I am getting an error

Traceback (most recent call last):
  File "/cyclists/lib/YOLOF/tools/train_net.py", line 109, in <module>
    args=(args,),
  File "/cyclists/lib/YOLOF/cvpods/engine/launch.py", line 56, in launch
    main_func(*args)
  File "/cyclists/lib/YOLOF/tools/train_net.py", line 95, in main
    runner.train()
  File "/cyclists/lib/YOLOF/cvpods/engine/runner.py", line 270, in train
    super().train(self.start_iter, self.start_epoch, self.max_iter)
  File "/cyclists/lib/YOLOF/cvpods/engine/base_runner.py", line 84, in train
    self.run_step()
  File "/cyclists/lib/YOLOF/cvpods/engine/base_runner.py", line 185, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../yolof_base/yolof.py", line 134, in forward
    pred_logits, pred_anchor_deltas)
  File "../yolof_base/yolof.py", line 210, in losses
    dist.all_reduce(num_foreground)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 935, in all_reduce
    _check_default_pg()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

Could you guide me what I am doing wrong?
My setup is

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:00:10.0 Off |                  N/A |
|  0%   46C    P8     8W / 180W |     20MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Cuda 10.1

commented

Screen Shot 2021-06-11 at 12 08 40 PM
If you want to run a job using single GPU, please make sure that the distributed part in the codes are well handled.