ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite

Home Page:https://docs.ultralytics.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hyperparameter Evolution

glenn-jocher opened this issue Β· comments

πŸ“š This guide explains hyperparameter evolution for YOLOv5 πŸš€. Hyperparameter evolution is a method of Hyperparameter Optimization using a Genetic Algorithm (GA) for optimization. UPDATED 28 March 2023.

Hyperparameters in ML control various aspects of training, and finding optimal values for them can be a challenge. Traditional methods like grid searches can quickly become intractable due to 1) the high dimensional search space 2) unknown correlations among the dimensions, and 3) expensive nature of evaluating the fitness at each point, making GA a suitable candidate for hyperparameter searches.

Before You Start

Clone repo and install requirements.txt in a Python>=3.7.0 environment, including PyTorch>=1.7. Models and datasets download automatically from the latest YOLOv5 release.

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

1. Initialize Hyperparameters

YOLOv5 has about 30 hyperparameters used for various training settings. These are defined in *.yaml files in the /data directory. Better initial guesses will produce better final results, so it is important to initialize these values properly before evolving. If in doubt, simply use the default values, which are optimized for YOLOv5 COCO training from scratch.

# Hyperparameters for low-augmentation COCO training from scratch
# python train.py --batch 64 --cfg yolov5n6.yaml --weights '' --data coco.yaml --img 640 --epochs 300 --linear
# See tutorials for hyperparameter evolution https://github.com/ultralytics/yolov5#tutorials
lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)
lrf: 0.01 # final OneCycleLR learning rate (lr0 * lrf)
momentum: 0.937 # SGD momentum/Adam beta1
weight_decay: 0.0005 # optimizer weight decay 5e-4
warmup_epochs: 3.0 # warmup epochs (fractions ok)
warmup_momentum: 0.8 # warmup initial momentum
warmup_bias_lr: 0.1 # warmup initial bias lr
box: 0.05 # box loss gain
cls: 0.5 # cls loss gain
cls_pw: 1.0 # cls BCELoss positive_weight
obj: 1.0 # obj loss gain (scale with pixels)
obj_pw: 1.0 # obj BCELoss positive_weight
iou_t: 0.20 # IoU training threshold
anchor_t: 4.0 # anchor-multiple threshold
# anchors: 3 # anchors per output layer (0 to ignore)
fl_gamma: 0.0 # focal loss gamma (efficientDet default gamma=1.5)
hsv_h: 0.015 # image HSV-Hue augmentation (fraction)
hsv_s: 0.7 # image HSV-Saturation augmentation (fraction)
hsv_v: 0.4 # image HSV-Value augmentation (fraction)
degrees: 0.0 # image rotation (+/- deg)
translate: 0.1 # image translation (+/- fraction)
scale: 0.5 # image scale (+/- gain)
shear: 0.0 # image shear (+/- deg)
perspective: 0.0 # image perspective (+/- fraction), range 0-0.001
flipud: 0.0 # image flip up-down (probability)
fliplr: 0.5 # image flip left-right (probability)
mosaic: 1.0 # image mosaic (probability)
mixup: 0.0 # image mixup (probability)
copy_paste: 0.0 # segment copy-paste (probability)

2. Define Fitness

Fitness is the value we seek to maximize. In YOLOv5 we define a default fitness function as a weighted combination of metrics: mAP@0.5 contributes 10% of the weight and mAP@0.5:0.95 contributes the remaining 90%, with Precision P and Recall R absent. You may adjust these as you see fit or use the default fitness definition (recommended).

yolov5/utils/metrics.py

Lines 12 to 16 in 4103ce9

def fitness(x):
# Model fitness as a weighted combination of metrics
w = [0.0, 0.0, 0.1, 0.9] # weights for [P, R, mAP@0.5, mAP@0.5:0.95]
return (x[:, :4] * w).sum(1)

3. Evolve

Evolution is performed about a base scenario which we seek to improve upon. The base scenario in this example is finetuning COCO128 for 10 epochs using pretrained YOLOv5s. The base scenario training command is:

python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache

To evolve hyperparameters specific to this scenario, starting from our initial values defined in Section 1., and maximizing the fitness defined in Section 2., append --evolve:

# Single-GPU
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --evolve

# Multi-GPU
for i in 0 1 2 3 4 5 6 7; do
  sleep $(expr 30 \* $i) &&  # 30-second delay (optional)
  echo 'Starting GPU '$i'...' &&
  nohup python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --device $i --evolve > evolve_gpu_$i.log &
done

# Multi-GPU bash-while (not recommended)
for i in 0 1 2 3 4 5 6 7; do
  sleep $(expr 30 \* $i) &&  # 30-second delay (optional)
  echo 'Starting GPU '$i'...' &&
  "$(while true; do nohup python train.py... --device $i --evolve 1 > evolve_gpu_$i.log; done)" &
done

The default evolution settings will run the base scenario 300 times, i.e. for 300 generations. You can modify generations via the --evolve argument, i.e. python train.py --evolve 1000.

yolov5/train.py

Line 608 in 6a3ee7c

for _ in range(opt.evolve): # generations to evolve

The main genetic operators are crossover and mutation. In this work mutation is used, with a 80% probability and a 0.04 variance to create new offspring based on a combination of the best parents from all previous generations. Results are logged to runs/evolve/exp/evolve.csv, and the highest fitness offspring is saved every generation as runs/evolve/hyp_evolved.yaml:

# YOLOv5 Hyperparameter Evolution Results
# Best generation: 287
# Last generation: 300
#    metrics/precision,       metrics/recall,      metrics/mAP_0.5, metrics/mAP_0.5:0.95,         val/box_loss,         val/obj_loss,         val/cls_loss
#              0.54634,              0.55625,              0.58201,              0.33665,             0.056451,             0.042892,             0.013441

lr0: 0.01  # initial learning rate (SGD=1E-2, Adam=1E-3)
lrf: 0.2  # final OneCycleLR learning rate (lr0 * lrf)
momentum: 0.937  # SGD momentum/Adam beta1
weight_decay: 0.0005  # optimizer weight decay 5e-4
warmup_epochs: 3.0  # warmup epochs (fractions ok)
warmup_momentum: 0.8  # warmup initial momentum
warmup_bias_lr: 0.1  # warmup initial bias lr
box: 0.05  # box loss gain
cls: 0.5  # cls loss gain
cls_pw: 1.0  # cls BCELoss positive_weight
obj: 1.0  # obj loss gain (scale with pixels)
obj_pw: 1.0  # obj BCELoss positive_weight
iou_t: 0.20  # IoU training threshold
anchor_t: 4.0  # anchor-multiple threshold
# anchors: 3  # anchors per output layer (0 to ignore)
fl_gamma: 0.0  # focal loss gamma (efficientDet default gamma=1.5)
hsv_h: 0.015  # image HSV-Hue augmentation (fraction)
hsv_s: 0.7  # image HSV-Saturation augmentation (fraction)
hsv_v: 0.4  # image HSV-Value augmentation (fraction)
degrees: 0.0  # image rotation (+/- deg)
translate: 0.1  # image translation (+/- fraction)
scale: 0.5  # image scale (+/- gain)
shear: 0.0  # image shear (+/- deg)
perspective: 0.0  # image perspective (+/- fraction), range 0-0.001
flipud: 0.0  # image flip up-down (probability)
fliplr: 0.5  # image flip left-right (probability)
mosaic: 1.0  # image mosaic (probability)
mixup: 0.0  # image mixup (probability)
copy_paste: 0.0  # segment copy-paste (probability)

We recommend a minimum of 300 generations of evolution for best results. Note that evolution is generally expensive and time consuming, as the base scenario is trained hundreds of times, possibly requiring hundreds or thousands of GPU hours.

4. Visualize

evolve.csv is plotted as evolve.png by utils.plots.plot_evolve() after evolution finishes with one subplot per hyperparameter showing fitness (y axis) vs hyperparameter values (x axis). Yellow indicates higher concentrations. Vertical distributions indicate that a parameter has been disabled and does not mutate. This is user selectable in the meta dictionary in train.py, and is useful for fixing parameters and preventing them from evolving.

evolve

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher I trained with --evolve, --nosave hyper parameter but i didnt receive last weights in runs folder.

commented

@buimanhlinh96 , evolve is will find the best hyper params after 10 epoch (if you didn't change it), you will want to take what was found to be the best and do a full training!

commented

@glenn-jocher did you find 10 epoch to give a decent indication about a full training? Is it whats the most cost efficient from what you have seen ?

@Ownmarc there is no fixed evolve scenario. You create the scenario and then just append --evolve to it and let it work. If you want to evolve full training, well, you know what to do. Any assumption about results from shorter training correlating with results of longer trainings is up to you.

commented

And did you find any correlations between model sizes ? Will some "best" hyp on yolov5s also do a good job on yolov5x or would it require its own evolve ?

@Ownmarc I have not evolved per model, but it's fairly obvious that whatever works best for a 7M parameter model will not be identical to whatever works best for a 90M parameter model.

commented

@glenn-jocher , I believe the V3.0 release has changed in the train.py. I didn't find the hyp at L18-43 in train.py . Instead , I found the

data/hyp.scratch.yaml

file with hyp set.
so, if I want to change the hyp to training , rewrite the hyp.scratch.yaml file is OK, right?

.

@Frank1126lin yes that's correct:

  • hyp.scratch.yaml will be automatically used by default
  • hyp.custom.yaml can be force-selected by python train.py --hyp hyp.custom.yaml

yolov5/train.py

Line 445 in c2523be

parser.add_argument('--hyp', type=str, default=ROOT / 'data/hyps/hyp.scratch.yaml', help='hyperparameters path')

@buimanhlinh96 ,hello, did you find the best hyp result after training with --evolve?

Hello, I run :
python train.py --epochs 10 --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --cache --evolve

and got an error like:

Namespace(adam=False, batch_size=16, bucket='', cache_images=True, cfg='./models/yolov5s.yaml', data='./data/coco128.yaml', device='', epochs=3, evolve=True, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Traceback (most recent call last):
File "train.py", line 525, in
hyp[k] = max(hyp[k], v[1]) # lower limit
KeyError: 'anchors'

Hello, I run :
python train.py --epochs 10 --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --cache --evolve

and got an error like:

Namespace(adam=False, batch_size=16, bucket='', cache_images=True, cfg='./models/yolov5s.yaml', data='./data/coco128.yaml', device='', epochs=3, evolve=True, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Traceback (most recent call last):
File "train.py", line 525, in
hyp[k] = max(hyp[k], v[1]) # lower limit
KeyError: 'anchors'

I also have same problem.

Hello, I run :
python train.py --epochs 10 --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --cache --evolve
and got an error like:
Namespace(adam=False, batch_size=16, bucket='', cache_images=True, cfg='./models/yolov5s.yaml', data='./data/coco128.yaml', device='', epochs=3, evolve=True, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Traceback (most recent call last):
File "train.py", line 525, in
hyp[k] = max(hyp[k], v[1]) # lower limit
KeyError: 'anchors'

I also have same problem.

remove β€˜anchor’ line will slove the problem

Hello, I run :
python train.py --epochs 10 --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --cache --evolve
and got an error like:
Namespace(adam=False, batch_size=16, bucket='', cache_images=True, cfg='./models/yolov5s.yaml', data='./data/coco128.yaml', device='', epochs=3, evolve=True, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Traceback (most recent call last):
File "train.py", line 525, in
hyp[k] = max(hyp[k], v[1]) # lower limit
KeyError: 'anchors'

I also have same problem.

remove β€˜anchor’ line will slove the problem

Thanks for helping. Could you explain more ? Which anchor line (in training or yaml)?

Removed this line and worked, But I need some explanation. Thanks again

Constrain to limits

        for k, v in meta.items():
            hyp[k] = max(hyp[k], v[1])  # lower limit
            hyp[k] = min(hyp[k], v[2])  # upper limit
            hyp[k] = round(hyp[k], 5)  # significant digits

Hello, I run :
python train.py --epochs 10 --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --cache --evolve
and got an error like:
Namespace(adam=False, batch_size=16, bucket='', cache_images=True, cfg='./models/yolov5s.yaml', data='./data/coco128.yaml', device='', epochs=3, evolve=True, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Traceback (most recent call last):
File "train.py", line 525, in
hyp[k] = max(hyp[k], v[1]) # lower limit
KeyError: 'anchors'

I also have same problem.

remove β€˜anchor’ line will slove the problem

Thanks for helping. Could you explain more ? Which anchor line (in training or yaml)?

Removed this line and worked, But I need some explanation. Thanks again

Constrain to limits

        for k, v in meta.items():
            hyp[k] = max(hyp[k], v[1])  # lower limit
            hyp[k] = min(hyp[k], v[2])  # upper limit
            hyp[k] = round(hyp[k], 5)  # significant digits

line475 in train.py:
'iou_t': (0, 0.1, 0.7), # IoU training threshold
'anchor_t': (1, 2.0, 8.0), # anchor-multiple threshold
#'anchors': (2, 2.0, 10.0), # anchors per output grid (0 to ignore)
'fl_gamma': (0, 0.0, 2.0), # focal loss gamma (efficientDet default gamma=1.5)
'hsv_h': (1, 0.0, 0.1), # image HSV-Hue augmentation (fraction)

Hello, I run :
python train.py --epochs 10 --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --cache --evolve
and got an error like:
Namespace(adam=False, batch_size=16, bucket='', cache_images=True, cfg='./models/yolov5s.yaml', data='./data/coco128.yaml', device='', epochs=3, evolve=True, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Traceback (most recent call last):
File "train.py", line 525, in
hyp[k] = max(hyp[k], v[1]) # lower limit
KeyError: 'anchors'

I also have same problem.

remove β€˜anchor’ line will slove the problem

Thanks for helping. Could you explain more ? Which anchor line (in training or yaml)?

Removed this line and worked, But I need some explanation. Thanks again

Constrain to limits

        for k, v in meta.items():
            hyp[k] = max(hyp[k], v[1])  # lower limit
            hyp[k] = min(hyp[k], v[2])  # upper limit
            hyp[k] = round(hyp[k], 5)  # significant digits

line475 in train.py:
'iou_t': (0, 0.1, 0.7), # IoU training threshold
'anchor_t': (1, 2.0, 8.0), # anchor-multiple threshold
#'anchors': (2, 2.0, 10.0), # anchors per output grid (0 to ignore)
'fl_gamma': (0, 0.0, 2.0), # focal loss gamma (efficientDet default gamma=1.5)
'hsv_h': (1, 0.0, 0.1), # image HSV-Hue augmentation (fraction)

Many Thanks...

#'anchors

Is commenting out anchors line will affect the hyper parameters?

@Samjith888 autoanchor will create new anchors if a value is found for hyp['anchors'], overriding any anchor information you specify in your model.yaml. i.e. you can set anchors: 5 to force autoanchor to create 5 new anchors per output layer, replacing the existing anchors. Hyperparameter evolution will evolve you an optimal number of anchors using this parameter.

@glenn-jocher You mean that if you comment out ['anchors'] in the 'hyp.scratch.yaml' file, the autoanchor will not work.Will yolov5_master produce  anchors based on the value of my model.yaml ['anchors']?
If we do not comment the ['anchors'] in the 'hyp.scratch.yaml' file, will autoanchor produce the specified number of anchors on each Detect layer?

@xinxin342 if a nonzero anchor hyperparameter is found, existing anchor information will be deleted and new anchors will be force-autocomputed.

wow ! this tutorials helps a lot ! many thanks !

Is there an argument to limit the maximum number of object detection per frame?

@Sergey-sib your question is not related to hyperparameter evolution.

max_det = 300 # maximum number of detections per image

Hi @glenn-jocher
The command:
for i in 0 1 2 3; do
nohup python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --evolve --device $i &
done
Seems that for multi-gpu evolution, it just repeat same progress in 4 GPUs with same parameters.

For example, I want to choose a best learning rate from the range of (0.1, 0.5), and the generations is 200. My understanding is that it will try 200 different lr to find a best one. Right?
So for multi-gpu evolution, if use the command
for i in 0 1 2 3; do
nohup python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --evolve --device $i &
done
It seems just repeat it four times? I think it should be something like that each gpu trys 50 different lr for the range of (0.1,0.2) (0.2,0.3) (0.3,0.4) and (0.4,0.5).

Please correct me if my understanding is not right. Thanks.

@cxzhou95 the tutorial commands are correct. The multi-GPU tutorial command instructs four GPUs to evolve 300 generations each, reading and writing to a common evolve.txt. If not stopped prematurely this will evolve 1200 generations on all hyperparameters.

how to decide the hyperparameter of yolov5s 、 the hyperparameter of yolov5m、 the hyperparameter of yolov5l?
they are setting with the same hyperparameter?

@alicera hyperparameters are evolved on YOLOv5m and then used across all models.

hellow,my command is python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --evolve
Does that mean I will evolve 10 * 300 times?
Does every epochs evolve 300 times?
Or did it evolve 300 times in these 10 epochs?
thank you so much.

@python-faker --evolve evolves the base train.py command for 300 generations. The base command is irrelevant.

@glenn-jocher I am a little confused about evolve, is it used for training or just find the hyperparameters by train a few epochs and then use the hyp found to retrain?
Thanks.

commented

Hi, I tried to evolve but it produced an memory error says:

Traceback (most recent call last):
  File "train.py", line 528, in <module>
    results = train(hyp.copy(), opt, device)
  File "train.py", line 166, in train
    dataloader, dataset = create_dataloader(train_path, imgsz, batch_size, gs, opt,
  File "D:\workspace\yolov5\utils\datasets.py", line 66, in create_dataloader
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
    dataloader = InfiniteDataLoader(dataset,
  File "D:\workspace\yolov5\utils\datasets.py", line 84, in __init__
    self.iterator = super().__iter__()
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\site-packages\torch\utils\data\dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\site-packages\torch\utils\data\dataloader.py", line 737, in __init__
    w.start()
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\asus\anaconda3\envs\yolov5\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
MemoryError

What could possibly be wrong?

@siyangxie it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@glenn-jocher I run the following commands, but the hyp is the same as hyp.scratch, why is that?

for i in 0 1 2 3; do
  nohup python train.py --img 640 \
        --batch 50 \
        --epochs 10 \
        --data ./data/widerface.yaml \
        --cfg ./models/yolov5l.yaml \
        --weights /app/models_saved/face_detect_l.pt \
        --single-cls --cache --evolve --device $i > evolve_gpu_$i.log &
done

The evolve.png is as follows:
image
I comment anchor in train.py as mentioned before, is that the cause?

@Edwardmark Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

  • Your modified or out-of-date code. If your issue is not reproducible in a new git clone version of this repo we can not debug it. Before going further run this code and verify your issue persists:
$ git clone https://github.com/ultralytics/yolov5 yolov5_new  # clone latest
$ cd yolov5_new
$ python detect.py  # verify detection

# CODE TO REPRODUCE YOUR ISSUE HERE
  • Your custom data. If your issue is not reproducible in one of our 3 common datasets (COCO, COCO128, or VOC) we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of your labels and images.

  • Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, verify your environment meets all of the requirements.txt dependencies specified below. If in doubt, download Python 3.8.0 from https://www.python.org/, create a new venv, and then install requirements.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@glenn-jocher I have gone through at least 8 generation cycles buy haven't seen a yolov5/evolve.png file. At what point is this created?

@nanometer34688 evolve.png is generated when evolution completes after 300 generations.

yolov5/train.py

Lines 570 to 571 in 7aeef2d

# Plot results
plot_evolution(yaml_file)

Hi fellows CV enthusiasts,
We are agreed that --evolve does not support a different img-size than the initial training img-size?!?!
concretely:
initial training -> img-size 640x640
vs
evolve training -> img-size 1137x640

@MikeHatchi the train.py --img-size argument accepts one scalar value (applied to both train and test sizes), or two scalar values, applied to train and test image sizes, but this is a training argument and not a hyperparameter.

You can evolve on any base scenario you want (including any image size you want).

@glenn-jocher hi and thx

Interesting... I have encountered a problem then when I wanted to evolve: I've chosen to --evolve with --img-size 1137 (with and without --rect as the size of the image is 1137x640) as the pre-trained model has been trained with initial 640x640.
Basically, it returns that 1137 is an unknown size.

If I take off the --img-size 1137, the evolve scenarii are launched

@MikeHatchi all YOLOv5 operations are constrained to run at multiples of the largest output stride of 32. If your requested img size does not meet the constraint it will be modified accordingly. If you believe there is a reproducible bug anywhere, please raise a bug report with code to reproduce. Thank you.

@glenn-jocher thx
basic!
I should have been more focused.
I'll raise a bug report if indeed that's still the case.

Hi @Ownmarc hope you're ok?! (or @glenn-jocher)

I've got 2 questions:

  1. doing a scenario with "evolve" will produce a "best.pt" file, right?
    Unfortunately, I've done a 40 epochs "evolve" scenario and after 23 hours, no .pt file has been saved.

  2. if I definitely need a high recall, I can deliberately act here: def fitness(x)?
    ...
    w = [0.0, 0.0, 0.1, 0.9] # weights for [P, R, mAP@0.5, mAP@0.5:0.95]
    ...

Many thanks

Nothing happened and GPU's are idle while doing hyper parameter evolution with 4 GPU

Command used

# Multi-GPU
for i in 0 1 2 3; do
  nohup python train.py --epochs 10 --data coco128.yaml --weights yolov5m.pt --cache --evolve --device $i > evolve_gpu_$i.log &
done

@Samjith888 you have an error in your commands and should view your evolve_gpu_$i.log files.

@Samjith888 you have an error in your commands and should view your evolve_gpu_$i.log files.

Found the following error inside the log file

Error

Traceback (most recent call last):
  File "train.py", line 528, in <module>
    results = train(hyp.copy(), opt, device)
  File "train.py", line 169, in train
    world_size=opt.world_size, workers=opt.workers)
  File "/home/yolov5/utils/datasets.py", line 61, in create_dataloader
    rank=rank)
  File "/home/yolov5/utils/datasets.py", line 378, in __init__
    labels, shapes = zip(*[cache[x] for x in self.img_files])
  File "/home/yolov5/utils/datasets.py", line 378, in <listcomp>
    labels, shapes = zip(*[cache[x] for x in self.img_files])
KeyError: '/home/dataset/train/images/train2014_000000000036.jpg'

commented

What should do for stage 2 Define fitness?

@Tony0726 if you have no alternative fitness requirements, you may simply use the default fitness metric, in which case no action is required on your part.

commented

@glenn-jocher Thanks a lot.

i trained with --evolve with multi gpu

# Multi-GPU
for i in 0 1 2 3; do
  nohup python train.py --img 640 --batch 32 --workers 8 --multi-scale --epochs 10 --data dataset.yaml --single-cls --weights yolov5s.pt --evolve --cache --device $i 2>&1 | tee evolve_gpu_$i.log &
done

but didn't get the weights in save, the folder was empty; train/evolve/weights was empty

@satyajitghana evolution does not produce weights, it evolves hyperparameters.

i trained with --evolve with multi gpu

# Multi-GPU
for i in 0 1 2 3; do
  nohup python train.py --img 640 --batch 32 --workers 8 --multi-scale --epochs 10 --data dataset.yaml --single-cls --weights yolov5s.pt --evolve --cache --device $i 2>&1 | tee evolve_gpu_$i.log &
done

but didn't get the weights in save, the folder was empty; train/evolve/weights was empty

Getting key error while using the same command

@satyajitghana evolution does not produce weights, it evolves hyperparameters.

aaah okay, makes sense. cool.

@Samjith888 was using a custom dataset, try with the included coco128.yaml

but turns out, evolution wasn't meant to save weights, so i guess its okay, i got my hyper params though.

@Samjith888 was using a custom dataset, try with the included coco128.yaml

but turns out, evolution wasn't meant to save weights, so i guess its okay, i got my hyper params though.

I'm using a custom dataset , not for coco128.yaml.. Already posted the error earlier here, but it disappeared somehow .

Hi All,

I did comment this line: #'anchors': (0, 2.0, 10.0), # anchors per output grid (0 to ignore)

but still getting this error:
Traceback (most recent call last):
File "train.py", line 561, in
hyp[k] = max(hyp[k], v[1]) # lower limit
KeyError: 'anchors'

any ideas? fix?
thanx
Frederik

@frederikvanduuren for hyperparameter evolution this line should be uncommented, and set to 0 to ignore, or to a standard anchor count (i.e. 3) to evolve anchor count.

How to choose the number of the epoch? In this tutorial, the epoch is 10, but if I have my own custom dataset. The baseline model needs 75 epochs to convergence. Should I set --evolve --epochs 75?

@TommyZihao Hi Tommy, i did set the epochs to 100, but you should see how much you need so the mAP & recall does converge to a maximum level

@TommyZihao Hi Tommy, i did set the epochs to 100, but you should see how much you need so the mAP & recall does converge to a maximum level

Would 100 epochs take a super long time?

Hi, @glenn-jocher , i used command 'python train.py --img 640 --batch 16 --epochs 300 --data mydataset.yaml --weights yolov5l.pt --evolve --cache' to take a try of Hyperparameter Evolution, but after more than 100 epochs, my yolov5/runs/train/evolve folder just have following files, no yolov5/runs/evolve/hyp_evolved.yaml, the weights folder is empty too.
image

meanwhile, in the yolov5 root folder, there is no yolov5/evolve.txt and yolov5/evolve.png.
Do you have any idea what might be the possible cause ? thanks !

@wwdok you have not finished training a single generation. < 1 generation will not produce any evolution output.

@frederikvanduuren @glenn-jocher I got it ! In my case, 300 epochs is 1 generation, by default, it need 1 generation to output yolov5/runs/evolve/hyp_evolved.yaml, and need 300 generations(that means 90000 epochs) to output the yolov5/evolve.png

@wwdok yes exactly! For this reason it may make sense to use a different base scenario that produces a fitness faster (i.e. perhaps only train to 100 epochs). But be careful, because you want your base scenario results to correlate strongly with your actual underlying scenario (training 300 epochs), so as you reduce your epoch count to zero the correlation with 300-epoch results will also reduce to zero.

In layman's terms, hyperparameters that help you achieve the best results over short trainings (i.e. in 10 epochs), will not be the same ones that help you achieve the best results at 300 epochs. At 10 epochs things like weight decay don't matter for example, so evolving on short trainings will cause your weight decay to evolve down to zero, which will cause earlier overfitting and worse results at 300 epochs. It's a balancing act each person has to decide on.

@wwdok yes exactly! For this reason it may make sense to use a different base scenario that produces a fitness faster (i.e. perhaps only train to 100 epochs). But be careful, because you want your base scenario results to correlate strongly with your actual underlying scenario (training 300 epochs), so as you reduce your epoch count to zero the correlation with 300-epoch results will also reduce to zero.

@glenn-jocher Can we clarify this a bit please, I read it like this: 10 epochs for evolving will result in 3000 total epochs generating 300 generations of hyperparameters. Or is it really 90000 as @wwdok stated (which would be ridiculous time wise)? Further I understand any img size parameters will not taken into account, but could we evolve with a smaller resolution set perhaps to speed it up a bit? If so how to achieve this other than modify the data set?

@thhart its very simple. Your base scenario is run for n generations. Your base scenario is what you are optimizing, it's epoch count is up to you.

@thhart its very simple. Your base scenario is run for n generations. Your base scenario is what you are optimizing, it's epoch count is up to you.

Simple is in the eye of the viewer, however it is not obvious IMHO how to configure the amount of generations being calculated. So when is this amount hit when I use --epoch 10 as parameter?

@thhart I think the number you're looking for is just epochs * generations

Running --evolve and after the first 10 epochs (first run out of 300 or whatever it is) i get the following error:

anchor_t anchors box cls cls_pw degrees fl_gamma fliplr flipud giou hsv_h hsv_s hsv_v iou_t lr0 lrf mixup momentum mosaic obj obj_pwperspective scale shear translatewarmup_bias_lrwarmup_epochswarmup_momentumweight_decay
4 3 0.05 0.5 1 0 0 0 0 0.05 0.014 0.68 0.36 0.2 0.01 0.2 0 0.937 1 1 1 0.001 0.5 0 0 0.1 3 0.8 0.0005
Evolved fitness: 0.4112 0.2054 0.2402 0.08847 0.04911 0.02177 0.02316

Traceback (most recent call last):
File "train.py", line 578, in
hyp[k] = float(x[i + 7] * v[i]) # mutate
IndexError: index 28 is out of bounds for axis 0 with size 28

Any idea what is wrong?

@LinusJ79 if you believe you have a reproducible bug please raise a full bug report issue using the bug report template with code to reproduce, thank you!

Hi! A few questons regarding evolve @glenn-jocher
1: why hasn't fl_gamma, flip_ud or iou_th changed in your picture?
2: Why are only 22 of the 28 hyperparameters evolved?
2: Is there a way to lock in certain hyperparameters that I know I want at a certain value?
3: Is mosiac disabled as usual with -rect?

@RobinBram

  1. You can disable evolution for any parameters using the meta dictionary in train.py, or by setting their initial values to zero in your hyp.yaml file.
  2. The displayed results are from an earlier version with less hyperparmeters, we should update this.
  3. Yes, see #1 above, use meta dict.
  4. Yes --rect causes mosaic to be disabled.

@glenn-jocher If we initialise any hyper parameter with 0 and the minimum value is also 0.0 then it doesn't evolve because -
hyp[k] = max(hyp[k], v[1]) # lower limit
hyp[k] = min(hyp[k], v[2]) # upper limit
hyp[k] = round(hyp[k], 5) # significant digits

@abhiagwl4262 yes this is correct. The mutations are gain-based so a zero initial condition will prevent it from mutating.

@glenn-jocher As initial values of hyper-parameter, we are using hyp.scratch which have some parameters initialised with 0. So, those parameters are not taking part in mutation. Can you please add hyp.scratch that sets better initialisation for Hyper-parameters.

@abhiagwl4262 yes I see. You may want to use hyp.finetune.yaml to see if it's a better starting point for evolution.

If you want you can also increase the zero values in hyp.scratch slightly, i.e. to 0.01 or 0.1 to initialize them for evolution.

What is the population size for the GA? It seems it only trains 1 model? Is this correct? If so, what is it performing crossover with?

@NMVRodrigues yes that is correct, population size is 1 due to high expense of each member, so we omit crossover and apply mutation to a randomly selected top-5 member from all previous populations. The implementation is here. If you have ideas for improvement please let us know!

yolov5/train.py

Lines 572 to 597 in ed2c742

for _ in range(300): # generations to evolve
if Path('evolve.txt').exists(): # if evolve.txt exists: select best hyps and mutate
# Select parent(s)
parent = 'single' # parent selection method: 'single' or 'weighted'
x = np.loadtxt('evolve.txt', ndmin=2)
n = min(5, len(x)) # number of previous results to consider
x = x[np.argsort(-fitness(x))][:n] # top n mutations
w = fitness(x) - fitness(x).min() # weights
if parent == 'single' or len(x) == 1:
# x = x[random.randint(0, n - 1)] # random selection
x = x[random.choices(range(n), weights=w)[0]] # weighted selection
elif parent == 'weighted':
x = (x * w.reshape(n, 1)).sum(0) / w.sum() # weighted combination
# Mutate
mp, s = 0.8, 0.2 # mutation probability, sigma
npr = np.random
npr.seed(int(time.time()))
g = np.array([x[0] for x in meta.values()]) # gains 0-1
ng = len(meta)
v = np.ones(ng)
while all(v == 1): # mutate until a change occurs (prevent duplicates)
v = (g * (npr.random(ng) < mp) * npr.randn(ng) * npr.random() * s + 1).clip(0.3, 3.0)
for i, k in enumerate(hyp.keys()): # plt.hist(v.ravel(), 300)
hyp[k] = float(x[i + 7] * v[i]) # mutate

@glenn-jocher Well in this scenario that would not ensure the exploitation of optimal solutions. There are other mutation operators that could be added, or just replace the standard one, to improve this scenario without any significant complexity/expense increase. I would be glad to try and implement them to help and improve this feature!

@NMVRodrigues yes, this is a challenge as we have a unique problem set that is not quite suited to the off-the-shelf GA methods, mainly due to the very high evaluation cost of a single population member, so the current implementation is the best compromise we found. Feel free to submit PRs with any updates to the evolution code in train.py if you see spots for improvement!

I have a problem when I evolve the hyper params, there is no model.pt in the evolve folder,and the wandb broke out a error.I think it is because the weight didn't save but transfered to wandb, in my code it is at line445.
I changed the no save option to avoid this error, but it can't evolve.
I disable the wandb and it can run correctly.
Can you fix this? thanks in advance.
image

@zzttqu thanks for the info! Evolution will not save any checkpoints (for speed). If you believe you have a reproducible bug, please raise a new issue using the πŸ› Bug Report template, providing screenshots and a minimum reproducible example to help us better understand and diagnose your problem. Thank you! @AyushExel seems like --evolve with wandb may have a problem.

@glenn-jocher okay I'll try to reproduce this and fix it after our meeting. Can we also include a CI test for evolve to automate this process in the future ?

@AyushExel evolve CI is an interesting idea. I'd need to add an optional argument to --evolve in some way because currently evolution is hard-coded to 300 generations.

@glenn-jocher @zzttqu I found the problem. The default behaviour now is to log the final stripped model but the model is not found in case of evolve operation. I'll just push a quick fix

This should fix the problem #2634

@zzttqu evolve with wandb should be fixed now in #2634, please git pull or clone again to receive this update and let us know if you run into any other problems!

@glenn-jocher Hi, sorry for bothering you. I have a question about hyperparameter evolve. I first evolved my own model No.1 (which is based on yolov5s) for 10 generations and trained a new model No.2 based on new hyperparameters. Then I would like to continue the evolution. According to what you said in another thread, if the 'evolve.txt' exists, then I just need to run the same command (which evolves on model No.1). However, when I ran the command, I found that the hyperparameters displayed at the start do not match with the hyperparameters in the evolve.txt (ones I used to train model No.2). I just wonder is this correct? Sorry if it is trivial, I am new to this area. Thank you for your time and help in advance!

@psyjw they should not match, you are starting a new generation.

@glenn-jocher Much thanks! I thought it will display the last generation's hyperparameters. By the way, could you tell me is there any difference if I continue training evolution on model No.1 or I train evolution on model No.2? If I train on model No.2, does it still read the values from the 'evolve.txt'?

Evolution scenario is entirely up to you. Evolve.txt is the sole source used.

@youyuxiansen this is a probabilistic mutation equation bounded at upper and lower limits, prototyped using empirical results. There is no documentation other than the actual equation. To understand it better you can simply use it to generate a population of values and visualize the histogram of the population.

@youyuxiansen this is a probabilistic mutation equation bounded at upper and lower limits, prototyped using empirical results. There is no documentation other than the actual equation. To understand it better you can simply use it to generate a population of values and visualize the histogram of the population.

@glenn-jocher Thank you for such a timely response! I still have some questions. Can you explain the means of "mutation probability" and "sigma"? Why it be chosen to 0.8 and 0.2? And why the v be clipped to (0.3, 3.0)? I guess you must make some try on it right? I'm interested in learning about the inspiration! I would be grateful if you could talk about the details. Thanks!

@youyuxiansen all parameters above are based upon empirical results of YOLOv5 evolution experimentation with COCO

@youyuxiansen all parameters above are based upon empirical results of YOLOv5 evolution experimentation with COCO

Got it!Thanks.

I'm trying to run --evolve on 2 GPUs but the process is stuck at nohup: redirecting stderr to stdout

This is the shell script that I ran:

#!/bin/bash
for i in 1 2; do
  nohup python train.py --img 640 --batch 16 --epochs 100 --data QMUL.yaml --weights yolov5m.pt --cache --evolve --device $i > evolve_gpu_$i.log &
done

Am I doing something wrong?

@OrjwanZaafarani you can always try the command without the nohup if it's causing problems, or you can redirect to /dev/null:

In ipython console:

# YOLOv5m6 evolve on COCO
for i in [0, 1, 2, 3, 4, 5, 6, 7]:
  !python train.py --batch 32 --weights '' --cfg yolov5m6.yaml --data coco.yaml --epochs 300 --img 640 --hyp hyp.scratch-p6-evolve.yaml --evolve --device {i} > /dev/null 2>&1 &