ultralytics / yolov5

📚 This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 🚀 on single or multiple machine(s). UPDATED 25 December 2022.

Before You Start

Clone repo and install requirements.txt in a Python>=3.7.0 environment, including PyTorch>=1.7. Models and datasets download automatically from the latest YOLOv5 release.

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

💡 ProTip! Docker Image is recommended for all Multi-GPU trainings. See Docker Quickstart Guide
💡 ProTip! torch.distributed.run replaces torch.distributed.launch in PyTorch>=1.9. See docs for details.

Training

Select a pretrained model to start training from. Here we select YOLOv5s, the smallest and fastest model available. See our README table for a full comparison of all models. We will train this model with Multi-GPU on the COCO dataset.

Single GPU

$ python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0

Multi-GPU DataParallel Mode (⚠️ not recommended)

You can increase the device to use Multiple GPUs in DataParallel mode.

$ python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

This method is slow and barely speeds up training compared to using just 1 GPU.

Multi-GPU DistributedDataParallel Mode (✅ recommended)

You will have to pass python -m torch.distributed.run --nproc_per_node, followed by the usual arguments.

$ python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

--nproc_per_node specifies how many GPUs you would like to use. In the example above, it is 2.
--batch is the total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.

The code above will use GPUs 0... (N-1).

Use specific GPUs (click to expand)

You can do so by simply passing --device followed by your specific GPUs. For example, in the code below, we will use GPUs 2,3.

$ python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --device 2,3

Use SyncBatchNorm (click to expand)

SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training.

It is best used when the batch-size on each GPU is small (<= 8).

To use SyncBatchNorm, simple pass --sync-bn to the command like below,

$ python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --sync-bn

Use Multiple machines (click to expand)

This is only available for Multiple GPU DistributedDataParallel training.

Before we continue, make sure the files on all machines are the same, dataset, codebase, etc. Afterwards, make sure the machines can communicate to each other.

You will have to choose a master machine(the machine that the others will talk to). Note down its address(master_addr) and choose a port(master_port). I will use master_addr = 192.168.1.1 and master_port = 1234 for the example below.

To use it, you can do as the following,

# On master machine 0
$ python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank 0 --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''

# On machine R
$ python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank R --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''

where G is number of GPU per machine, N is the number of machines, and R is the machine number from 0...(N-1).
Let's say I have two machines with two GPUs each, it would be G = 2 , N = 2, and R = 1 for the above.

Training will not start until all N machines are connected. Output will only be shown on master machine!

Notes

Windows support is untested, Linux is recommended.
--batch must be a multiple of the number of GPUs.
GPU 0 will take slightly more memory than the other GPUs as it maintains EMA and is responsible for checkpointing etc.
If you get RuntimeError: Address already in use, it could be because you are running multiple trainings at a time. To fix this, simply use a different port number by adding --master_port like below,

$ python -m torch.distributed.run --master_port 1234 --nproc_per_node 2 ...

Results

DDP profiling results on an AWS EC2 P4d instance with 8x A100 SXM4-40GB for YOLOv5l for 1 COCO epoch.

Profiling code

# prepare
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/coco:/usr/src/coco $t
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
cd .. && rm -rf app && git clone https://github.com/ultralytics/yolov5 -b master app && cd app
cp data/coco.yaml data/coco_profile.yaml

# profile
python train.py --batch-size 16 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0 
python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 32 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1   
python -m torch.distributed.run --nproc_per_node 4 train.py --batch-size 64 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3  
python -m torch.distributed.run --nproc_per_node 8 train.py --batch-size 128 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3,4,5,6,7

GPUs A100	batch-size	CUDA_mem ^{device0 (G)}	COCO ^train	COCO ^val
1x	16	26GB	20:39	0:55
2x	32	26GB	11:43	0:57
4x	64	26GB	5:57	0:55
8x	128	26GB	3:09	0:57

FAQ

If an error occurs, please read the checklist below first! (It could save your time)

Checklist (click to expand)

Have you properly read this post?
Have you tried to reclone the codebase? The code changes daily.
Have you tried to search for your error? Someone may have already encountered it in this repo or in another and have the solution.
Have you installed all the requirements listed on top (including the correct Python and Pytorch versions)?
Have you tried in other environments listed in the "Environments" section below?
Have you tried with another dataset like coco128 or coco2017? It will make it easier to find the root cause.

If you went through all the above, feel free to raise an Issue by giving as much detail as possible following the template.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

Credits

I would like to thank @MagicFrogSJTU, who did all the heavy lifting, and @glenn-jocher for guiding us along the way.

There will be multiple/redundant outputs. It does not affect training. This is a WIP.

I suggest we use will be fixed in the future instead of WIP. Many probably don't know what is WIP.
By the way, explain all the abbreviations. We must assume Users know nothing!

Multiple GPUs DistributedDataParallel Mode (Recommended!!)

I suggest we should explictly make it clear that DDP is faster than DP. Use this title

Multiple GPUs DistributedDataParallel Mode (Faster than DP, Recommended!!)

The tutorial is excellent! Good job!

Traceback (most recent call last):
File "train.py", line 482, in
train(hyp, tb_writer, opt, device)
File "train.py", line 130, in train
with torch_distributed_zero_first(local_rank):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/amax/objectdetection/yolov5/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "train.py", line 482, in
train(hyp, tb_writer, opt, device)
File "train.py", line 130, in train
with torch_distributed_zero_first(local_rank):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/amax/objectdetection/yolov5/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "train.py", line 482, in
train(hyp, tb_writer, opt, device)
File "train.py", line 130, in train
with torch_distributed_zero_first(local_rank):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/amax/objectdetection/yolov5/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "train.py", line 482, in
train(hyp, tb_writer, opt, device)
File "train.py", line 130, in train
with torch_distributed_zero_first(local_rank):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/amax/objectdetection/yolov5/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "train.py", line 482, in
train(hyp, tb_writer, opt, device)
File "train.py", line 130, in train
with torch_distributed_zero_first(local_rank):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/amax/objectdetection/yolov5/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "train.py", line 482, in
train(hyp, tb_writer, opt, device)
File "train.py", line 130, in train
with torch_distributed_zero_first(local_rank):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/amax/objectdetection/yolov5/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "train.py", line 482, in
train(hyp, tb_writer, opt, device)
File "train.py", line 130, in train
with torch_distributed_zero_first(local_rank):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/amax/objectdetection/yolov5/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
zwyzedu@163.com
Password for 'https://zwyzedu@163.com@gitee.com':
Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
device1 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
device2 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
device3 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
device4 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
device5 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
device6 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
device7 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)

Traceback (most recent call last):
File "train.py", line 468, in
dist.init_process_group(backend='nccl', init_method='env://') # distributed backend
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/amax/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/amax/anaconda3/envs/pytorch/bin/python', '-u', 'train.py', '--local_rank=7', '--batch-size', '64', '--data', 'data/7classes.yaml', '--cfg', 'models/yolov5s.yaml', '--weights', '']' returned non-zero exit status 1.

store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Hello @feizhouxiaozhu , I think this may be because you are running multiple trainings at a time, and they are communicating to the same port. To fix this, you can run in a different port.
Using the example from above, add --master_port ####, where #### is a random port number.

$ python -m torch.distributed.launch --master_port 42342 --nproc_per_node 2 ...

Please tell me if this fixed the problem. If it doesn't, can you tell us how to replicate this problem?

Hmm, I'm not sure why that is. @feizhouxiaozhu , could you try to re-clone the repo then try again?

If error still occurs, could you try to run on coco128? Run the code below in terminal.

cd yolov5
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD"
python -m torch.distributed.launch --master_port 9990 --nproc_per_node 2 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320

I'm currently running 8 GPU DDP custom data training, and there is no issue.

Edit: Reply was removed. @feizhouxiaozhu , is the problem solved?

Excellent guide guys, thank you so much! I was training on a DGX1 and was wondering why there wasn't much of a speed difference.

@cesarandreslopez oh wow, lucky you. Are you seeing faster speeds now with the updated multi gpu training?

@glenn-jocher in DataParallel model, every Epoch, with about 51000 images in yolov5l.yaml was taking on the DGX1 about 6 and a half minutes.

on DistributedDataParallel Mode with SyncBatchNorm I am seeing about 3 minutes and 10 seconds, so quite an improvement.

I've seen no improvement in Testing speed.

On @NanoCode012's guide there is this note:

--batch-size is now the Total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.

Based on that I assumed that batch size could be something like --batch 1024, (128 per GPU), but I kept getting Cuda out of memory after an epoch was completed and it started to test, so I eventually just went with --batch 128.

Apparent GPU use during training and testing.

During training, GPU 0 seems to have a considerably higher RAM use than other GPUS (which limits the batch size to be around the same that one GPU could handle). The processing itself seems distributed on all GPUs

GPU consumption during testing looks like this, where GPU 0 has very high memory use but it doesn't seem to process while the other 7 GPUS seem busy with the amount of memory expected for a batch of that size:

Our training size for this example is about 51000 images and our testing sample is about 5100. Testing takes about 4 minutes and a half, an epoch on training takes about 3 minutes and 10 seconds

Given the amount of time this spends on testing I am wondering if it is possible or even useful to set testing every n epochs. We are currently studying up on this repository and will understand it enough soon to be able to offer PRs.

@glenn-jocher Happy to provide you remote access to the machine for your tests and so on. It's the least we can do! Just PM me.

Hi @cesarandreslopez , nice numbers!

The reason GPU 0 has higher memory is because it has to communicate with other GPUs to coordinate. In my test however, I don’t see that vast of a difference in GPU memory like you do. The latest one is 31GB (GPU 0) and 20 GB (others). Maybe SynBN is increasing GPU load or dataloaders for testing(?).

Batch size is indeed divided evenly. Is it possible to run 128 batch size on your single GPU because that is quite large for yolov5l.

Testing is done on only 1 GPU(GPU0 tests , other gpu continue train) , so that may be why you experience slow testing times. It’s is currently being worked on to use multiple GPUs there.

It is an interesting concept to test every n epochs and can certainly be done. However, maybe randomness will cause you to miss the “best” epoch, so I’m not sure if it’s good.

Edit: If you would like to do so, it’s on line 339 in Train.py, add a (epoch%interval==0) condition

Edit2: How is speed without SynBN? Since the individual batch size is around 128/8 > 8, I’m not sure if accuracy would be affected.

Edit3: If you have multiple machines you want to run this training on, there is an experimental PR you could try.

@cesarandreslopez ok got it, thanks for the feedback. I think I know why your testing is CUDA OOM. Before the DDP updates train and test.py shared the same batch-size (default 32), it seems likely this is still the case, except that test.py is inheriting global batch size instead of local batch size. So I suspect you should be able to train with much larger batch sizes once this bug is fixed. @NanoCode012 does that make sense about the global vs local batch sizes being passed to test.py?

Testing every n epochs is a good idea. You can currently use python test.py --notest to train without testing until the very final epoch, but we don't have a middle ground. Testing may not benefit as much from multi-gpu compared to training, because NMS ops run sequentially rather than in parallel, and tend to dominate testing time. An alternative to testing every n epochs is simply to supply a higher --conf-thres to test at. Default is 0.001, perhaps setting to 0.01 will halve your testing time.

That's a very generous offer! I'm pretty busy these days so I can't take you up on it immediately, but I'll keep that in mind in the future, thank you! It would definitely be nice to have access to something like that.

@glenn-jocher , I just noticed that! That may be why the memory is so different. But now it’s up to optimizations. For small “total batch size”, it makes sense to pass in the entire thing. For large “total”, it doesn’t make sense.

I think one easy solution is to let user pass in one argument “—test-total”, to test their total batch size vs their divided batchsize. But it can get confusing for newcomers.

Edit: What do you think?

@NanoCode012 if we replace total_batch_size with batch_size on L194:

yolov5/train.py

Lines 191 to 196 in fd532d9

    
           # Testloader 
        
           if rank in [-1, 0]: 
        
               # local_rank is set to -1. Because only the first process is expected to do evaluation. 
        
               testloader = create_dataloader(test_path, imgsz_test, total_batch_size, gs, opt, hyp=hyp, augment=False, 
        
                                              cache=opt.cache_images, rect=True, local_rank=-1, world_size=opt.world_size)[0]

And L341 would that solve @cesarandreslopez issue about testing OOM?

yolov5/train.py

Lines 339 to 348 in fd532d9

    
           if not opt.notest or final_epoch:  # Calculate mAP 
        
               results, maps, times = test.test(opt.data, 
        
                                                batch_size=total_batch_size, 
        
                                                imgsz=imgsz_test, 
        
                                                save_json=final_epoch and opt.data.endswith(os.sep + 'coco.yaml'), 
        
                                                model=ema.ema.module if hasattr(ema.ema, 'module') else ema.ema, 
        
                                                single_cls=opt.single_cls, 
        
                                                dataloader=testloader, 
        
                                                save_dir=log_dir)

If we do so, datasets for testing could take num_gpu times longer. (I remember training/testing with total batchsize 16 for coco taking 1h) .

I think giving user an option is good, but we should set test to use totalbatchsize to be on by default.. Only when user has OOM, should they configure it. “—notest—total” sounds good?

@NanoCode012 ok got it. I think the most common use case is for users to maximize training cuda mem, so since test.py is currently restricted to single-gpu it would make sense to default it to batch_size rather than total_batch_size. But I suppose we should wait for @MagicFrogSJTU work on test.py before really modifying, since it will get a makeover shortly here. I think it's best to try and simplify the options when possible so it 'just works' as steve jobs would say, so let's avoid adding extra arguments if possible.

@cesarandreslopez I think for the time being you could apply the L194 and L341 fix described above, we have a few more significant PRs due in the coming week, so a more permanent fix for this should be included in those.

@NanoCode012 does that make sense about the global vs local batch sizes being passed to test.py?

@glenn-jocher
After my fix, the training.py would run parallel test and global_batch_size would be split into small local_batch_size in the test time just like the training time. Problem solved.

@glenn-jocher please note that when --notest is used on the current master branch it will crash after completing the first epoch.

Traceback (most recent call last):
  File "train.py", line 469, in <module>
    train(hyp, tb_writer, opt, device)
  File "train.py", line 371, in train
    with open(results_file, 'r') as f:  # create checkpoint
FileNotFoundError: [Errno 2] No such file or directory: 'runs/exp0/results.txt'

I tried doing a touch results.txt under the /runs/expoN/ folder that will avoid the error above, but then a new one will appear:

Traceback (most recent call last):
  File "train.py", line 469, in <module>
    train(hyp, tb_writer, opt, device)
  File "train.py", line 380, in train
    if (best_fitness == fi) and not final_epoch:
UnboundLocalError: local variable 'fi' referenced before assignment

so adding --notest to the command above, in yolov5 will not work right now. (this does work on yolov3 on previous tests).

Edit 1: @NanoCode012 if I follow your suggestion:

Edit: If you would like to do so, it’s on line 339 in Train.py, add a (epoch%interval==0) condition

The same error describe here as --notest will appear.

@cesarandreslopez should be fixed following PR #518. Tested on single-GPU and CPU.

hi! @glenn-jocher for multi-gpu training, if using smaller batch size than 64, could you suggest the hyperparameter to adjust like the learning rate?

hi! @glenn-jocher for multi-gpu training, if using smaller batch size than 64, could you suggest the hyperparameter to adjust like the learning rate?

Internally, batch size is kept at least 64. Gradient accumulation will be used if a batch size smaller than 64 is given. Therefore, no adjust is needed if you use a smaller batch size.

Hello, I have the following problem when using multi-GPU training, which is done according to your command line.

#not working on multi-GPU training.

Hello @liumingjune, could you provide us the exact line you used?

EDIT: Also, did you use the latest repo? I think this can be the reason.

Hello @liumingjune, could you provide us the exact line you used? Looking at the screenshot, did you pass in --local_rank argument?

Thank you for your reply. My command line is

python -m torch.distributed.launch --nproc_per_node 4 train. py --device 0,1,2,3
I have 4 GPUs totally.

Hi @liumingjune , could you try to pull or clone the repo again? I saw that your hyp values are old, and train function is missing some arguments.

I ran

git clone https://github.com/ultralytics/yolov5.git && cd yolov5
python -m torch.distributed.launch --nproc_per_node 4 train.py --device 0,1,2,3

and there were no problems.

Hi @liumingjune , could you try to pull or clone the repo again? I saw that your hyp values are old, and train function is missing some arguments.

I ran
git clone https://github.com/ultralytics/yolov5.git && cd yolov5
python -m torch.distributed.launch --nproc_per_node 4 train.py --device 0,1,2,3
and there were no problems.

OK. I will try. Maybe that's the reason. I will try. My version is a clone of Yolov5 when it first appeared.Thanks a lot!

Hello, I want to know the difference between the current version and the version just released before, because I find that the form of data set preparation is different. The previous one is to prepare the data set path and the training file, verify the file. I need to manually separate out the training data and the validation data. This is not friendly to large data volumes.

@liumingjune I don't know exactly what you're referring to, but the full change history is available here https://github.com/ultralytics/yolov5/commits/master

well, i got the same problem with @feizhouxiaozhu if I set
`--nproc_per_node 6 or 8 ',
2 or 4 is OK.

python3 -m torch.distributed.launch --master_port 9999 --nproc_per_node 8 train.py --batch-size 256 --data data/shape.yaml --cfg models/yolov5x.yaml --weights ' '

subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--batch-size', '256', '--data', 'data/shape.yaml', '--cfg', 'models/yolov5x.yaml', '--weights', '']' returned non-zero exit status 1.

well, i got the same problem with @feizhouxiaozhu if I set
`--nproc_per_node 6 or 8 ',
2 or 4 is OK.

python3 -m torch.distributed.launch --master_port 9999 --nproc_per_node 8 train.py --batch-size 256 --data data/shape.yaml --cfg models/yolov5x.yaml --weights ' '

subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--batch-size', '256', '--data', 'data/shape.yaml', '--cfg', 'models/yolov5x.yaml', '--weights', '']' returned non-zero exit status 1.

Hi @Frank1126lin , could you tell me where/when that error occured?

I ran the below(your code but on coco2017) on a new clone, and there were no issues till training epoch 1. Did you try to reclone?

python3 -m torch.distributed.launch --master_port 9999 --nproc_per_node 8 train.py --batch-size 256 --data data/coco.yaml --cfg models/yolov5x.yaml --weights ' '

@feizhouxiaozhu 's error is most likely due to an old clone as stated. Proper DDP training was added not too long ago.

OK. I will try. Maybe that's the reason. I will try. My version is a clone of Yolov5 when it first appeared.Thanks a lot!

well, i got the same problem with @feizhouxiaozhu if I set
--nproc_per_node 6 or 8 ', 2 or 4 is OK. python3 -m torch.distributed.launch --master_port 9999 --nproc_per_node 8 train.py --batch-size 256 --data data/shape.yaml --cfg models/yolov5x.yaml --weights ' ' subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--batch-size', '256', '--data', 'data/shape.yaml', '--cfg', 'models/yolov5x.yaml', '--weights', '']' returned non-zero exit status 1.`

Hi @Frank1126lin , could you tell me where/when that error occured?

I ran the below(your code but on coco2017) on a new clone, and there were no issues till training epoch 1. Did you try to reclone?
python3 -m torch.distributed.launch --master_port 9999 --nproc_per_node 8 train.py --batch-size 256 --data data/coco.yaml --cfg models/yolov5x.yaml --weights ' '
@feizhouxiaozhu 's error is most likely due to an old clone as stated. Proper DDP training was added not too long ago.

OK. I will try. Maybe that's the reason. I will try. My version is a clone of Yolov5 when it first appeared.Thanks a lot!

OK,thank you so much for your ans. I will try to reclone this repo and try it again.
and annother question, when I use --nproc_per_node 4, it seems takes almost same time compare with single GPU training. just like code below.
python3 -m torch.distributed.launch --nproc_per_node 4 train.py --batch-size 256 --data data/shape.yaml --cfg yolov5x.yaml --weights '' --epochs 2400
Optimizer stripped from runs/exp2/weights/last.pt, 177.4MB Optimizer stripped from runs/exp2/weights/best.pt, 177.4MB 2400 epochs completed in 2.726 hours.

I got the same problem as below:

root@:~/ai/yolov5-0818# python3 -m torch.distributed.launch --master_port 9999  --nproc_per_node 8 train.py  --batch-size 128 --data shape.yaml --cfg yolov5l.yaml --weights '' --device 0,1,2,3,4,5,6,7
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device1 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device2 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device3 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device4 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device5 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device6 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device7 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='./models/yolov5l.yaml', data='./data/shape.yaml', device='0,1,2,3,4,5,6,7', epochs=300, evolve=False, global_rank=0, hyp='data/hyp.scratch.yaml', img_size=[640, 640], local_rank=0, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=128, weights='', workers=8, world_size=8)
Start Tensorboard with "tensorboard --logdir runs/", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mixup': 0.0}

                 from  n    params  module                                  arguments                     
  0                -1  1      7040  models.common.Focus                     [3, 64, 3]                    
  1                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  2                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  3                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  4                -1  1   1627904  models.common.BottleneckCSP             [256, 256, 9]                 
  5                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  6                -1  1   6499840  models.common.BottleneckCSP             [512, 512, 9]                 
  7                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 2]             
  8                -1  1   2624512  models.common.SPP                       [1024, 1024, [5, 9, 13]]      
  9                -1  1  10234880  models.common.BottleneckCSP             [1024, 1024, 3, False]        
 10                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1]             
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1   2823680  models.common.BottleneckCSP             [1024, 512, 3, False]         
 14                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1    707328  models.common.BottleneckCSP             [512, 256, 3, False]          
 18                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1   2561536  models.common.BottleneckCSP             [512, 512, 3, False]          
 21                -1  1   2360320  models.common.Conv                      [512, 512, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1  10234880  models.common.BottleneckCSP             [1024, 1024, 3, False]        
 24      [17, 20, 23]  1     37695  models.yolo.Detect                      [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [256, 512, 1024]]
Model Summary: 335 layers, 4.73987e+07 parameters, 4.73987e+07 gradients
Optimizer groups: 110 .bias, 118 conv.weight, 107 other
Scanning labels ../shape/labels/train.cache (3 found, 0 missing, 0 empty, 0 duplicate, for 3 images): 3it [00:00, 5409.68it/s]
Scanning labels ../shape/labels/train.cache (3 found, 0 missing, 0 empty, 0 duplicate, for 3 images): 3it [00:00, 7362.73it/s]
Traceback (most recent call last):
  File "train.py", line 458, in <module>
Traceback (most recent call last):
  File "train.py", line 458, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 167, in train
Traceback (most recent call last):
  File "train.py", line 458, in <module>
    ema.updates = start_epoch * nb // accumulate  # set EMA updates
AttributeError: 'NoneType' object has no attribute 'updates'
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 458, in <module>
  File "train.py", line 458, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 167, in train
    train(hyp, opt, device, tb_writer)
  File "train.py", line 167, in train
    ema.updates = start_epoch * nb // accumulate  # set EMA updates
AttributeError: 'NoneType' object has no attribute 'updates'
    ema.updates = start_epoch * nb // accumulate  # set EMA updates
AttributeError: 'NoneType' object has no attribute 'updates'
    train(hyp, opt, device, tb_writer)
    train(hyp, opt, device, tb_writer)
  File "train.py", line 167, in train
  File "train.py", line 167, in train
    ema.updates = start_epoch * nb // accumulate  # set EMA updates
    ema.updates = start_epoch * nb // accumulate  # set EMA updates
AttributeError: 'NoneType' object has no attribute 'updates'
AttributeError: 'NoneType' object has no attribute 'updates'
Traceback (most recent call last):
  File "train.py", line 458, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 167, in train
Traceback (most recent call last):
  File "train.py", line 458, in <module>
    ema.updates = start_epoch * nb // accumulate  # set EMA updates
AttributeError: 'NoneType' object has no attribute 'updates'
    train(hyp, opt, device, tb_writer)
  File "train.py", line 167, in train
    ema.updates = start_epoch * nb // accumulate  # set EMA updates
AttributeError: 'NoneType' object has no attribute 'updates'

Analyzing anchors... anchors/target = 6.32, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 3 dataloader workers
Starting training for 300 epochs...
Traceback (most recent call last):
  File "train.py", line 458, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 238, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--batch-size', '128', '--data', 'shape.yaml', '--cfg', 'yolov5l.yaml', '--weights', '', '--device', '0,1,2,3,4,5,6,7']' returned non-zero exit status 1.

Hi @Frank1126lin , regarding the ema error, I just created a fix for this at #775 and waiting for review. I am not as sure about the second error. Can you replicate this on coco dataset?

Edit: Also are there any errors in Single GPU mode?

For training time, I haven't done any test in a while, so I cannot say.

Hi, @NanoCode012 , follow your #775 instruction , the ema.update issue has gone. Thanks a lot.
And on sigle GPU training , no issue.

another warning comes out with cmd below: (with training in process)

python3 -m torch.distributed.launch --nproc_per_node 4 train.py --batch-size 64 --data shape.yaml --cfg yolov5l.yaml --weights '' --device 0,1,2,3

Warning as below:

/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before 
`optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before 
`lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more 
details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

with coco128 datasets , it can training in process with the warning as below:

/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

for my own datasets, error occuring as below:

root:~/ai/yolov5-0818# python3 -m torch.distributed.launch --nproc_per_node 8 train.py --batch-size 512 --data shape.yaml  --cfg shape5l.yaml --weights '' --epochs 1200 --adam
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device1 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device2 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device3 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device4 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device5 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device6 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)
           device7 _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', total_memory=32510MB)

Namespace(adam=True, batch_size=64, bucket='', cache_images=False, cfg='./models/shape5l.yaml', data='./data/shape.yaml', device='', epochs=1200, evolve=False, global_rank=0, hyp='data/hyp.scratch.yaml', img_size=[640, 640], local_rank=0, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=512, weights='', workers=8, world_size=8)
Start Tensorboard with "tensorboard --logdir runs/", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mixup': 0.0}

                 from  n    params  module                                  arguments                     
  0                -1  1      7040  models.common.Focus                     [3, 64, 3]                    
  1                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  2                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  3                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  4                -1  1   1627904  models.common.BottleneckCSP             [256, 256, 9]                 
  5                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  6                -1  1   6499840  models.common.BottleneckCSP             [512, 512, 9]                 
  7                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 2]             
  8                -1  1   2624512  models.common.SPP                       [1024, 1024, [5, 9, 13]]      
  9                -1  1  10234880  models.common.BottleneckCSP             [1024, 1024, 3, False]        
 10                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1]             
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1   2823680  models.common.BottleneckCSP             [1024, 512, 3, False]         
 14                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1    707328  models.common.BottleneckCSP             [512, 256, 3, False]          
 18                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1   2561536  models.common.BottleneckCSP             [512, 512, 3, False]          
 21                -1  1   2360320  models.common.Conv                      [512, 512, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1  10234880  models.common.BottleneckCSP             [1024, 1024, 3, False]        
 24      [17, 20, 23]  1     37695  models.yolo.Detect                      [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [256, 512, 1024]]





Model Summary: 335 layers, 4.73987e+07 parameters, 4.73987e+07 gradients



Optimizer groups: 110 .bias, 118 conv.weight, 107 other
Scanning labels ../shape/labels/train.cache (3 found, 0 missing, 0 empty, 0 duplicate, for 3 images): 3it [00:00, 5444.79it/s]
Scanning labels ../shape/labels/train.cache (3 found, 0 missing, 0 empty, 0 duplicate, for 3 images): 3it [00:00, 7186.13it/s]
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 239, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 239, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 239, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    train(hyp, opt, device, tb_writer)
    self._try_put_index()
  File "train.py", line 239, in train
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 239, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 239, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 239, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError

Analyzing anchors... anchors/target = 6.40, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 3 dataloader workers
Starting training for 1200 epochs...
Traceback (most recent call last):
  File "train.py", line 459, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 239, in train
    pbar = enumerate(dataloader)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 764, in __init__
    self._try_put_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 357, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 208, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 80, in __iter__
    assert len(indices) == self.total_size
AssertionError
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--batch-size', '512', '--data', 'shape.yaml', '--cfg', 'shape5l.yaml', '--weights', '', '--epochs', '1200', '--adam']' returned non-zero exit status 1.

Hi @Frank1126lin , I have never gotten that error before. If coco128 works, but your custom dataset doesn't. Then we clearly see the problem.

A quick search got me here open-mmlab/mmdetection#1223 .

I then noticed you used only 3 pictures? but 8 GPUS? How is it logical 😕 ? Although this might be a good idea to add a check if the number of images is divisble by the number of GPUs..

@NanoCode012 there's code in the create_dataloader() function in datasets.py to reduce batch-size to the length of the dataset when people try to use stupidly small datasets. In this case you probably want an assert to produce a hard error on someone trying to use more GPUs or nodes than images in their dataset.

yolov5/utils/datasets.py

Line 63 in fb4fc8c

batch_size = min(batch_size, len(dataset))

Hi @Frank1126lin , I have never gotten that error before. If coco128 works, but your custom dataset doesn't. Then we clearly see the problem.

A quick search got me here open-mmlab/mmdetection#1223 .

I then noticed you used only 3 pictures? but 8 GPUS? How is it logical ? Although this might be a good idea to add a check if the number of images is divisble by the number of GPUs..

OK， I got that.Thanks a lot. I was just testing , so didn't notice that difference. I will change the datasets and try again.
sorry about that again, and as for me ,I am just a beginner at DL, and I like yolov5 so much.
so , shall we add some notice to begginners like me ? just like @glenn-jocher said.

@Frank1126lin is right, we want to add some more error checking for this particular eventuality.

If there's one thing I've learned about making open source code it's that people will find ways to break what you wrote that you never considered, so abundant error checking seems to be a must. Even if you thought people would use your code for a, b and c, they will also use it for d, e, and f that you never considered, so best to have all the bases covered with good explanation asserts to help guide usage in the right direction, i.e. "ERRO: your dataset size is 3, but you are using 8 GPUs. Reduce GPUs to 3 or increase dataset size >= 8."

Do you have a branch/fork of the main repo that is collecting all these changes?

If I try to run with --device=0,1 without using the torch.distributed.launch, I consistently receive an error from a convolution:

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

even though I never encounter this when running on a single gpu. I thought maybe this thread would fix that but I get all sorts of errors when trying it. Seems there are a lot of suggested edits in the previous posts so it seems like a curated branch would be best for developing this. I didn't see one.

Also if anyone knows a simple fix for my particular problem, love to hear it!!

I got the yolov5l model on my own datasets with about 1200 epochs. It takes me 11.4hours in my machine with TESLA V100 * 8. I also used the python -m torch.distributed.launch --nproc_per_node 8 ...,Is there any other ways to accelerating this process?

and I got this warning at the begining in my training:

Using 5 dataloader workers
Starting training for 3800 epochs...
Epoch gpu_mem GIoU obj cls total targets img_size
0/3799 7.06G 0.02126 0.006547 0.0009961 0.0288 46 640: 0%| | 0/1 [00:23<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

It seems about something with pytorch's warning.
And for the weight that trained complete, it works well.

@Frank1126lin make sure you are using the latest version of pytorch and this repo. In your screenshot you are not using all available CUDA memory, which would accelerate training.

I got the yolov5l model on my own datasets with about 1200 epochs. It takes me 11.4hours in my machine with TESLA V100 * 8. I also used the python -m torch.distributed.launch --nproc_per_node 8 ...,Is there any other ways to accelerating this process?

I know you are new in DL, but 11 hours is very fast, for my 10000+ images dataset, it takes over 100 hours with 1080Ti*4.
raise your batch size until your GPU mem is fully used, you only used 7.84G of your GPU MEM.
btw: are you using python3.6? python3.8 is recommended

@wudashuo right. YOLOv5x COCO trainings take about 10-12 days on a single V100. Another trick is that single GPU training is always more efficient, i.e. you can train 8 models with 1 GPU each in parallel faster than 8x GPU trainings in series. This allows for the most efficient parallelized hyperparameter searches etc.

@glenn-jocher hi glenn, thanks for your amazing work! I have a question about DDP.
When I used
python -m torch.distributed.launch --nproc_per_node 4 train.py --batch-size 64 --data mydata.yaml --cfg yolov5l.yaml --weights yolov5l.pt
on one machine(1080Ti * 4), it took about 4m10s per epoch.
Yesterday I was trying to use two machines(1080Ti * 4 per machine), so I doubled the batch size, all 8 GPUs are fully used, but it took longer than training on one machine(5m36s per epoch).
on machine 0:
python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr "192.168.1.18" --master_port 1234 train.py --batch-size 128 --data mydata.yaml --cfg yolov5l.yaml --weights yolov5l.pt --epochs 2000 --notest
on machine 1:
python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr "192.168.1.18" --master_port 1234 train.py --batch-size 128 --data mydata.yaml --cfg yolov5l.yaml --weights yolov5l.pt --epochs 2000 --notest
this is machine 0 screenshot:

machine 1 didn't show anything, no training process:

training on one machine, 1min faster per epoch:

I am wondering what went wrong? 8 GPUs can't be slower than 4 GPUs.

UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Hi @Frank1126lin , I usually see this when using a small dataset that has a short epoch.

Hi @wudashuo ,

machine 1 didn't show anything, no training process:

Regarding output for multi-machine, it will only output on the master machine because it will be redundant to output on other machines. I will add this info to the guide.

Yesterday I was trying to use two machines(1080Ti * 4 per machine), so I doubled the batch size, all 8 GPUs are fully used, but it took longer than training on one machine(5m36s per epoch).

There will be overhead using multiple machine compared to single machine, however it is weird that you don't get any speed improvement at all. I don't have two machines to test this, but from single machine test, 8 GPU is faster than 4 GPU. It would be great if you can find the underlying reason.

@glenn-jocher On a probably related note, I ran a few test of yolov5s (5 epochs) at varying batch sizes on 2 GPU DDP but found that there to be varying time taken. An increase of batch size does not always decrease time taken. Maybe there is some bottleneck ?

Env: My jupyterlab docker at commit 5f07782 (For future ref)

Batch-size	Epoch 0	Epoch 1	Epoch 2
64	9:45	6:43	6:54
128	9:54	5:54	5:47
256	9:01	6:18	6:23

I ran a full COCO run on yolov5s at varying batch size as well. (2GPU DDP) The graph is quite interesting. There is a ripple effect at larger batch sizes.

Result from test.py

Batch-size	mAP 0.5	mAP 0.5..0.95
64	56.2	36.9
128	56.18	36.99
256	56.58	37.0

Hypothesis: could a bigger batch-size get better results for a model?

Edit: See below for updated table!

@NanoCode012 that's really interesting! This has been something I've been worried about, as I'm forced to use small batch sizes for the larger models when training. Yes, larger batches provide smoother batchnorm statistics and optimizer gradients, so it makes sense that they should result in better final models. The response vs epochs is really interesting also, with the oscillations. This may be due to EMA, as at bs 256 it's updating 4X less than at bs 64, but it's hard to tell.

The repo defaults to scaling the loss by batch size, so bs 256 has 2x the loss per optimizer update, but half as many optimizer updates.

I don't know why speed doesn't improve at bs 256 vs 128. Perhaps the dataloader is a bottleneck at the larger batch sizes, as 256x4 images need to be loaded from the hard drive then at the same time. Maybe larger batch sizes would benefit from more dataloader workers, or caching the dataset with --cache. I can't say, as I don't have much multi-gpu experience unfortunately.

@NanoCode012 another point is that bs increases may provide much more speed improvements up to bs64, and less afterwards, as for example at bs16, there are 4 forward passes before an optimizer update, but at bs64 and up there is always a single forward pass. So moving from bs16 to 32 to 64 one should see significant speed gains, past that the benefit may plateau.

@glenn-jocher @wudashuo @NanoCode012 Thank you all guys. Yeah , I am using a small datasets(99 images) and with small batchsize 512, I will try agian later with all these advice.
What amazing job you guys have done. Amazing magic...

@Frank1126lin a 99 image dataset (too small to produce useful results), logically can only be paired with a batch-size of up to 99.

@Frank1126lin a 99 image dataset (too small to produce useful results), logically can only be paired with a batch-size of up to 99.

I trained 1200 epochs with 99 images (2 classes), and the result seems powerfull(mAP:0,5 is 0.923). I used batch-size 64 to train again yestoday with 3800 epochs, still in process now.
As you mentioned above, seems I used too many GPUs. ╮（╯＿╰）╭

@Frank1126lin , fyi, if your individual batchsize (totalbatchsize/num_gpu) is small (generally <16), maybe using syncbn could give you some performance boost.

Edit: I’m not sure about its effect on small datasets though..

@Frank1126lin , fyi, if your individual batchsize (totalbatchsize/num_gpu) is small (generally <16), maybe using syncbn could give you some performance boost.

@NanoCode012 , OK, got you, thanks a lot. （゜▽＾*））

Hi! I finished another test with varying batch sizes on yolov5m. I'll combine the earlier table of 5s for comparison.
I also attempted to finetune the 5s model on the new hyp for 20&100 epochs.

Result

Command: python test.py --data coco.yaml --img 640 --conf 0.001

Env: Docker py36
5s trained and fine-tuned on commit: 5f07782
5m trained on commit: 0c01afc

Model	Batch-size	mAP 0.5	mAP 0.5..0.95
5s	64	56.2	36.9
	128	56.18	36.99
	256	56.58	37.0

5m	32	63.11	44.1
	64	62.77	43.92
	128	63.09	44.0

5s fine_20	256	55.21	36.02
5s fine_100	256	56.13	37.03

Conclusion

Since I trained 5s and 5m on two different commits(I'm questioning why I did so..), we cannot properly compare them against each other. However, we see that batch size does not mean a higher accuracy. It could just be a fluke that the earlier 5s 256bs had a higher value than single. Is it weird that the 5m_ddp could not match your single training by 1 whole percentage?

Fine tuning the 5s does not seem to make it go higher than the normal training. Maybe the new hyp aren't as good for COCO or this idea isn't a good one?

larger Batch training, may be we can try LARS(Layer-wise Adaptive Rate Scaling)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@glenn-jocher in DataParallel model, every Epoch, with about 51000 images in yolov5l.yaml was taking on the DGX1 about 6 and a half minutes.

on DistributedDataParallel Mode with SyncBatchNorm I am seeing about 3 minutes and 10 seconds, so quite an improvement.

I've seen no improvement in Testing speed.

On @NanoCode012's guide there is this note:
--batch-size is now the Total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.
Based on that I assumed that batch size could be something like --batch 1024, (128 per GPU), but I kept getting Cuda out of memory after an epoch was completed and it started to test, so I eventually just went with --batch 128.

Apparent GPU use during training and testing.

During training, GPU 0 seems to have a considerably higher RAM use than other GPUS (which limits the batch size to be around the same that one GPU could handle). The processing itself seems distributed on all GPUs

GPU consumption during testing looks like this, where GPU 0 has very high memory use but it doesn't seem to process while the other 7 GPUS seem busy with the amount of memory expected for a batch of that size:

Our training size for this example is about 51000 images and our testing sample is about 5100. Testing takes about 4 minutes and a half, an epoch on training takes about 3 minutes and 10 seconds

Given the amount of time this spends on testing I am wondering if it is possible or even useful to set testing every n epochs. We are currently studying up on this repository and will understand it enough soon to be able to offer PRs.

@glenn-jocher Happy to provide you remote access to the machine for your tests and so on. It's the least we can do! Just PM me.

Cool, which command you used to show this image ?

how to use 4 GPU

Hi, @glenn-jocher. This is my situation: I have two GPUs like RTX2080ti and GPU0 is about 1GB less memory than GPU1 for display usage, and in single machine multiple gpus mode GPU 0 will take more memory than the other GPUs as you say. So can I specify physical GPU1 as code GPU0 for bigger batch-size probably?

@laisimiao you may be able to force alter your device order with CUDA_VISIBLE_DEVICES 1,0 before your training command.

@cesarandreslopez what's software you use to visualize the cpu+gpu cost image？

Nvtop @cswwp

What is the best go for 8x32G V100 using torch.distributed.launch? A setting with --batch-size 256 for yolov5s is still surplus, but since gpuutil is near full, there wouldn't be much benefit increasing batch size. Besides, GPU 0 is already overtaken. Do we need to play with hyp.scratch.yaml w.r.t batch size as well?

@serser hyperparameters are unaffected by batch size. You can start any DDP training using the existing tutorial above, nothing special is needed in your case.

My code can run. But when I use DDP, it can't run all epochs, sometimes it broke out this error

and sometimes it broke out this error

I have two GPU and the program was running on ubuntu18.

@zzttqu it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

Hi,

I am trying to train the YOLOV5 model in my custom dataset in Azure ML platform with V100 Compute Cluster and 2 nodes are available but getting the following error:

train.py: error: unrecognized arguments: -m torch.distributed.launch --nproc_per_node 2

I am issuing the command for training as : python train.py -m torch.distributed.launch --nproc_per_node 2 --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5x.pt

Please help on this, thanks in advance!!

Hi,

I am trying to train the YOLOV5 model in my custom dataset in Azure ML platform with V100 Compute Cluster and 2 nodes are available but getting the following error:

train.py: error: unrecognized arguments: -m torch.distributed.launch --nproc_per_node 2

I am issuing the command for training as : python train.py -m torch.distributed.launch --nproc_per_node 2 --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5x.pt

Please help on this, thanks in advance!!

It is :python -m torch.distributed.launch --nproc_per_node 2 train.py
not :python train.py -m torch.distributed.launch --nproc_per_node 2

Hi,
I am trying to train the YOLOV5 model in my custom dataset in Azure ML platform with V100 Compute Cluster and 2 nodes are available but getting the following error:
train.py: error: unrecognized arguments: -m torch.distributed.launch --nproc_per_node 2
I am issuing the command for training as : python train.py -m torch.distributed.launch --nproc_per_node 2 --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5x.pt
Please help on this, thanks in advance!!

It is :python -m torch.distributed.launch --nproc_per_node 2 train.py
not :python train.py -m torch.distributed.launch --nproc_per_node 2

`from azureml.core import ScriptRunConfig
from azureml.core.runconfig import MpiConfiguration

src = ScriptRunConfig(source_directory=project_folder,
script='train.py',
arguments=["-m", "torch.distributed.launch", "--nproc_per_node", 2, "--img", 640, "--batch", 16, "--epochs", 3, "--data", "coco128.yaml", "--weights", "yolov5x.pt"],
compute_target=compute_target,
distributed_job_config=MpiConfiguration(process_count_per_node=1, node_count=2),
environment=pytorch_env)`

To run the script I have to give the commands something like this, but can you give some idea to make "-m", "torch.distributed.launch", "--nproc_per_node", 2 first and train.py after that in Azure ML??

Now, I am able to run the code with the following error :

assert torch.cuda.device_count() > opt.local_rank
AssertionError

Can you give some idea how to resolve this issue??

Hi,

I am using the following command to train the model -

python -m torch.distributed.launch --nproc_per_node 2 train.py --img 1000 --batch 4 --epochs 6 --data data/coco128.yaml --cfg models/coco.yaml --weights yolov5x.pt

I am getting the following error -

File "train.py", line 564, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 95, in train
    with torch_distributed_zero_first(rank):
  File "/azureml-envs/azureml_77c1ca2b2e6464d1ceec5d148b2932b2/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/mnt/batch/tasks/shared/LS_root/jobs/amlws-arcb-ce-eu2-dev/azureml/yolov5_test_1616740686_6f76f5a0/mounts/workspaceblobstore/azureml/YOLOV5_Test_1616740686_6f76f5a0/utils/torch_utils.py", line 31, in torch_distributed_zero_first
    torch.distributed.barrier()
  File "/azureml-envs/azureml_77c1ca2b2e6464d1ceec5d148b2932b2/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2419, in barrier
    default_pg = _get_default_group()
  File "/azureml-envs/azureml_77c1ca2b2e6464d1ceec5d148b2932b2/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Can anybody help me on this??

Thanks in advance!!

@R1234A your command is incorrect, --cfg, if supplied, must be a model yaml file. You've pointed it to a dataset yaml file. In any case though you don't need a model --cfg as you've already supplied pretrained --weights. Please see Multi-GPU tutorial to get started:

YOLOv5 Tutorials

Train Custom Data 🚀 RECOMMENDED
Weights & Biases Logging 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
ONNX and TorchScript Export
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
TensorRT Deployment

Hi, when I use multi-gpu option, mAP of model will drop by several points? Do I need increase learning rate or make other changes when I use multi-gpu? Waiting for you response !

@xyl3902596 you may want to apply --sync-bn when training multi-gpu to synchronize batchnorm layer statistics across CUDA devices:

python -m torch.distributed.launch --nproc_per_node 8 train.py --sync

@NanoCode012 I'm looking into update our DDP implementation to use mp.spawn as in this tutorial:
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Are you aware of any problems we might run into? I think first I'll try to reproduce the tutorial and then create a new ddp_spawn branch to try to reproduce our existing DDP performance using torch.distributed.launch. If everything goes well then we could use mp.spawn anytime more than one --device is passed to train.py. What do you think?

Hi @glenn-jocher ,

As I recall, mp.spawn is slower and performs worse than torch.distributed.launch. I'm not sure if it has changed over the past year, but past Issues say the same thing. (In fact, even pytorch-lightning still does not recommend it.)
pytorch/pytorch#47587
https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#distributed-data-parallel-spawn

The thread below also infers that multi-node training would be "less convenient" with mp.spawn.
https://discuss.pytorch.org/t/torch-distributed-launch-vs-torch-multiprocessing-spawn/95738

I just found out that torch.distributed.launch is deprecated. Pytorch recommends torch.distributed.run.
https://pytorch.org/docs/stable/elastic/run.html

How about following pytorch-lightning example and "launch"/"run" ourselves if --device > 1? I think it's not that complicated to replicate.
https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py

@NanoCode012 ok, good points, you've convinced me to avoid spawn! I think we should first migrate over to torch.distributed.run and then once it's working stably we can try working towards a wrapper that launches DDP itself on device count > 0.

EDIT: See PR #3680. PR tests well with torch.distributed.run, so I think we should merge this PR and then update this tutorial to require torch>=1.9.0 only for DDP (docker image is already on 1.9.0).

I am trying to train with multiple GPUs single machine with --sync-bn but i see my training does not start and it freezes with GPUs 100% usage until the process is killed.

I am using the following command
python -m torch.distributed.launch --nproc_per_node 2 train.py --img 1280 --cfg yolov5m.yaml --hyp hyp.scratch.yaml --batch 10 --epochs 300 --data custom_data.yaml --weights yolov5m6.pt --name m8_2_distributed_sync_on --device 0,1 --sync-bn
I have two GPUs RTX2080
Could you please let me know if i am doing something different?

Evereything works great if i remove --sync-bn

@malapatiravi yes --sync is currently a known issue with YOLOv5 and torch 1.9. You might try downgrading torch to earlier versions or simply train without the sync flag (none of the official models used --sync for training).

@glenn-jocher Thank you!

@malapatiravi yes --sync is currently a known issue with YOLOv5 and torch 1.9. You might try downgrading torch to earlier versions or simply train without the sync flag (none of the official models used --sync for training).

Thank you @glenn-jocher

These are proven of Ultralytics excellent talent, leadership, and great "public services".

Cheers,
Steve

I'm trying to train yolo on a azure node with 2 gpus and I'm getting this error:


[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 64865 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=60000) ran for 64865 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 110275) of binary: /anaconda/envs/azureml_py38/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group

Any ideas how to fix this or what could be the issue?

@brunovollmer 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible that still produces the same problem
✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

at least 64

I actually have a similar question that whether there is a common rule to adjust the learning rate given the number of GPUs and batch size for DDP scenario.

@slimwangyue LR is automatically adjusted to batch size and DDP settings, no action is required on your part.

@slimwangyue LR is automatically adjusted to batch size and DDP settings, no action is required on your part.

Thanks for your timely response. Could you further explain what change you made to automatically adjust LR depending on batch size and DDP settings? Thanks.

@slimwangyue loss automatically is scaled with batch size and world size in loss.py and train.py.

Using the python -m torch.distributed.launch --nproc_per_node 4 later changed it to torch.distributed.run command runs alright but hangs at Destroying process group... once training is done.

I went ahead and added the destroy_process_group() once training is done but now the process just hangs.

Looking at the GPU, the memory for the 1st GPU is still occupied. see the screenshot attached.

I am using 4 v100s and ran python -m torch.distributed.run --nproc_per_node 4 train.py --cfg models/yolov5x.yaml --weights yolov5x.pt --data data/data.yaml --hyp data/hyps/params.yaml --name fold_2_run_2_stage_1 --project runs/exp2 --epochs 30 --workers 16 --bbox_interval 1 --save-period 10 --cache --device 0,1,2,3 --batch 64

To get the terminal prompt I have to manually kill the process.

Been trying to figure it out but am not sure what the problem could be.

@mrdvince your code or your environment may be an issue, you should train DDP inside the Docker image with the latest YOLOv5 commit.

Hello, when I trained with 3 GPUs, I found that it was slower than one. Then I observed the training process and found that the time spent in optimizer.step() would gradually increase from about 0.35s in the beginning to about 2s. Do you know why?

command:
python -m torch.distributed.launch --nproc_per_node 3 train.py --device 0,1,2

@yunxi1 our official profiling results on reproducible hardware (P4d) and software (Ubuntu 20.04 Deep Learning AMI with YOLOv5 Docker image) are below:

@glenn-jocher Thank you for your reply. I'll check the reasons that may lead to unsatisfactory results

你好，如何同时使用DDP模式和断点续传模式（resume）?

@lvxuanxuan123 for DDP resume example see https://github.com/ultralytics/yolov5/blob/master/utils/aws/resume.py

Hmm, I'm not sure why that is. @feizhouxiaozhu , could you try to re-clone the repo then try again?

If error still occurs, could you try to run on coco128? Run the code below in terminal.
cd yolov5
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD"
python -m torch.distributed.launch --master_port 9990 --nproc_per_node 2 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320
I'm currently running 8 GPU DDP custom data training, and there is no issue.

Edit: Reply was removed. @feizhouxiaozhu , is the problem solved?

When I use "python -m torch.distributed.launch --nproc_per_node 4 train.py --batch-size 8 --device 4,5,6,7 --sync-bn",report errors：
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/zhuyu/anaconda3/envs/yolo5obb/bin/python', '-u', 'train.py', '--local_rank=3', '--batch-size', '8', '--device', '4,5,6,7', '--sync-bn']' returned non-zero exit status 1.

@LUO77123 command looks fine. Run all Multi-GPU trainings in Docker for best results.

@LUO77123命令看起来不错。在 Docker 中运行所有多 GPU 训练以获得最佳结果。

Other GPUs are occupied, and these are empty（4,5,6,7）

@LUO77123 command looks fine. Run all Multi-GPU trainings in Docker for best results.

I tried the following command（python -m torch.distributed.launch --nproc_per_node 4 --master_port 88888 train.py --epochs 70 --imgsz 1120 --batch-size 8 --sync-bn）in yolov5_master(v6.1), the result can be run, but the following statement will appear(Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed).I don't know how to set it. (Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed) can not appear. Thanks！

As I said before use docker, this environment variable is set automatically there.

As I said before use docker, this environment variable is set automatically there.

Can I understand that this environment variable is set automatically in docker, so I don't care about that(Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed).

@glenn-jocher @NanoCode012 Thank you guys for very nice work.

SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training.

Can you give some numbers to show how much syncbn will slow down the training? For example, if I use 8 GPU, T_use_sbn / T_notuse_sbn will be how much?

@cissoidx not much

@glenn-jocher in DataParallel model, every Epoch, with about 51000 images in yolov5l.yaml was taking on the DGX1 about 6 and a half minutes.

on DistributedDataParallel Mode with SyncBatchNorm I am seeing about 3 minutes and 10 seconds, so quite an improvement.

I've seen no improvement in Testing speed.

On @NanoCode012's guide there is this note:
--batch-size is now the Total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.
Based on that I assumed that batch size could be something like --batch 1024, (128 per GPU), but I kept getting Cuda out of memory after an epoch was completed and it started to test, so I eventually just went with --batch 128.

Apparent GPU use during training and testing.

During training, GPU 0 seems to have a considerably higher RAM use than other GPUS (which limits the batch size to be around the same that one GPU could handle). The processing itself seems distributed on all GPUs

GPU consumption during testing looks like this, where GPU 0 has very high memory use but it doesn't seem to process while the other 7 GPUS seem busy with the amount of memory expected for a batch of that size:

Our training size for this example is about 51000 images and our testing sample is about 5100. Testing takes about 4 minutes and a half, an epoch on training takes about 3 minutes and 10 seconds

Given the amount of time this spends on testing I am wondering if it is possible or even useful to set testing every n epochs. We are currently studying up on this repository and will understand it enough soon to be able to offer PRs.

@glenn-jocher Happy to provide you remote access to the machine for your tests and so on. It's the least we can do! Just PM me.

Excuse me could you tell me how to get the GPU performance panel image ?

	# Testloader
	if rank in [-1, 0]:
	# local_rank is set to -1. Because only the first process is expected to do evaluation.
	testloader = create_dataloader(test_path, imgsz_test, total_batch_size, gs, opt, hyp=hyp, augment=False,
	cache=opt.cache_images, rect=True, local_rank=-1, world_size=opt.world_size)[0]

	if not opt.notest or final_epoch: # Calculate mAP
	results, maps, times = test.test(opt.data,
	batch_size=total_batch_size,
	imgsz=imgsz_test,
	save_json=final_epoch and opt.data.endswith(os.sep + 'coco.yaml'),
	model=ema.ema.module if hasattr(ema.ema, 'module') else ema.ema,
	single_cls=opt.single_cls,
	dataloader=testloader,
	save_dir=log_dir)

Multi-GPU Training 🌟

Before You Start

Training

Single GPU

Multi-GPU DataParallel Mode (⚠️ not recommended)

Multi-GPU DistributedDataParallel Mode (✅ recommended)

Notes

Results

FAQ

Environments

Status

Credits

Apparent GPU use during training and testing.

Result

Conclusion

Apparent GPU use during training and testing.

Requirements

Environments

Status

YOLOv5 Tutorials

How to create a Minimal, Reproducible Example

Apparent GPU use during training and testing.