Confused about resize aspect ratio training/inferencing

Question

Confused about resize aspect ratio training/inferencing

erikstauber opened this issue 2 years ago · comments

During training, using DarkMark/darknet CUDA on my linux box, I deselected 'resize images' because I didn't want the network to change the aspect ratio of images.... many are smaller than my network dimensions.

Typical small training images: 96 x 32
Network dimensions: 192 x 96 x 1
Network: YoloV4-Tiny

As the training progresses, DarkMark seems to properly utilize darknet to predict annotations on unmarked images. So all is good up to this point.

When I run inferencing on my Windows box, (using darkhelp API, directly with OpenCV/Cuda), I notice that the NN performs very poorly. Digging in, I see that the function DarkHelp::NN::predict_internal_opencv() calls the function fast_resize_ignore_aspect_ratio. If I change that to resize_keeping_aspect_ratio, then the network performs properly. Shouldn't the prediction function maintain the aspect ratio?

Stéphane Charette · Answer 1 · Thu Dec 22 2022 10:46:37 GMT+0800 (China Standard Time)

You don't have a choice, at some point, someone or something needs to resize your image. Because Darknet only knows about the network dimensions. So by the time Darknet processes the image, it must be the same as the network dimensions. You can either be explicit about resizing it, or you can let Darknet resize it for you.

I don't understand how you changing that line would make any difference. Because even if you resize it keeping the aspect ratio, by the time Darknet sees the image, it will be the wrong size, and Darknet then resizes it again without keeping the aspect ratio the same. There should be no difference, other than the fact that you're resizing twice and taking up more time.

When you train, at the very least you definitely should choose the option to resize the images. Depending on what you're doing, you may also want to consider the random crop-and-zoom option, which for many (most?) networks helps greatly. See here for a description of what that does: https://www.ccoderun.ca/darkmark/darknet_images.html#crop_and_zoom_images

The fact that your images are so incredibly tiny (32 pixels in height!?) may expose some resizing issue that no-one has seen before. I hope you didn't crop your images? https://www.ccoderun.ca/programming/darknet_faq/#crop_training_images If so, you're definitely going about this the wrong way. This is how annotations are supposed to work: https://www.ccoderun.ca/programming/darknet_faq/#image_markup

Stéphane Charette · Answer 2 · Thu Dec 22 2022 10:50:44 GMT+0800 (China Standard Time)

...and now that I think about it, this doesn't make any sense:

Typical small training images: 96 x 32
Network dimensions: 192 x 96 x 1

If your images are 96x32 then why is your network 192x96? Your network should never be larger than your images!

Erik Stauber · Answer 3 · Fri Dec 23 2022 02:05:53 GMT+0800 (China Standard Time)

@stephanecharette

Your YouTube tutorials on license plate and street sign text recognition using YOLO was the inspiration for how I could solve a problem where I need to do relatively simple OCR on streaming input images (as an alternative to Tesseract, which I continually struggle to get it to work properly with some particular minus signs). I only need to detect numbers, decimal points, and minus signs, and I know the exact regions of each area I wish to read those numbers, hence the very small network. Some of the training images were just tiny little clips I put in, but after more carefully reading through your crop and zoom links, I understand your point about images should most definitely be bigger than the network. Many thanks on that one.

As an aside, there has been quite a bit of discussion regarding maintaining aspect ratios when fitting images to network dimensions. Apparently the original version of YOLO retained the aspect ratio by letterboxing, and the current v4 version does not but maintains the option to do so (in yolov4-csp.cfg defines a letter_box=1 to do that). Lots of detailed discussion in this issue: AlexeyAB/darknet#232 (comment)

In any case, even my initial attempt works great so thanks for the guidance!