DistributedDataParallel ?

Question

DistributedDataParallel ?

learningyan opened this issue 3 years ago · comments

Hi! Does this codebase support the DistributedDataParallel now?
Besides, when I try to search in my dataset, the loss is increasing. The same config format as provided examples, whats's wrong?

Alex Parinov · Answer 1 · Thu Dec 03 2020 17:57:01 GMT+0800 (China Standard Time)

Hey, @learningyan

AutoAlbument doesn't support DistributedDataParallel for now, but it is in my roadmap, and I plan to add it in the next few months.

As for loss, now I am creating a benchmark for AutoAlbument on multiple datasets for classification and segmentation. When this benchmark if finished, I can share more intuition behind loss values and their meaning, but for now, here is my experience with loss based on running AutoAlbument on multiple datasets:

a_loss is a loss for the policy network (or Generator in terms of GAN), the network that applies augmentations to input images. d_loss is a loss for the Discriminator, the network that tries to guess whether the input image is an augmented or non-augmented one. loss is a task-specific loss (CrossEntropyLossfor classification,BCEWithLogitsLoss` for semantic segmentation) that acts as a regularizer and prevents the policy network from applying such augmentations that will make an object with class A looks like an object with class B.
During each iteration, a_loss and d_loss could increase or decrease, and that's ok. The only problematic option is when a_loss always increases and never decreases after each batch, and d_loss always decreases and never increases after each batch. That means that somehow Discriminator is only getting better and better at each step, and the Policy Network couldn't produce augmented images that could fool Discriminator.

Jan Witowski · Answer 2 · Thu Jan 14 2021 19:31:15 GMT+0800 (China Standard Time)

@creafz thanks a lot for this writeup. I'd love to hear more intuition behind losses and assessing the quality of AutoAlbument e.g. to understand at least initially whether the training was successful. In some of my initial experiments d_loss is pretty stable, although the value range is massive (e.g. between e-8 to e+8). Meanwhile, a_loss always decreases or always increases, going into e+9 / e-9 values a few epochs into the training.

Alex Parinov · Answer 3 · Tue Jan 19 2021 16:51:19 GMT+0800 (China Standard Time)

Hey, @jwitos

Now I am finishing AutoAlbument experiments with datasets such as CIFAR10, ImageNet, and Pascal VOC. I am planning to add a description of those experiments and loss values to the documentation.

Briefly speaking, I think that the only representative metric for the quality of AutoAlbument training is "Average Parameter change" (that is, how much augmentation parameters changed at the end of the epoch compared to the beginning of the epoch). This metric should decrease and then plateau on some value. But I think that this metric is heavily dependent on the size of a dataset, and if the dataset is small, it can be very noisy.

Here are, for example, Tensorboard logs for one of my CIFAR10 experiments - https://tensorboard.dev/experiment/hpqoQQEATAy9XhpDbvKSKA/#scalars&_smoothingWeight=0. "Average Parameter change" decreasing at the end of the training, while a_loss and d_loss are increasing.

Alex Parinov · Answer 4 · Thu Jan 28 2021 16:34:58 GMT+0800 (China Standard Time)

@jwitos I have added TensorBoard logs for AutoAlbument configs from the examples directory. Hope that helps - https://albumentations.ai/docs/autoalbument/metrics/

A few advice I could give:

As a baseline for the number of epochs to train AutoAlbument you can take the number of epochs for training your base model and divide it by 10. So if you train your classification model for 200 epochs use 20 epochs as a value for AutoAlbument to search for augmentation policies.
To choose an AutoAlbument epoch that produces the best policy monitor "Average Parameter change". It should decrease, then it will plateau and start to oscillate. Choose a policy from the epoch on which "Average Parameter change" plateaued. If there is no clear pattern in "Average Parameter change" just take the policy from the last epoch.

Long Dang · Answer 5 · Fri Mar 01 2024 05:02:13 GMT+0800 (China Standard Time)

@creafz: If I want to extend the base code for multiple gpu processing, where should I start? Also, can you help reupload Tensorboard logs for the CIFAR10, ImageNet, and Pascal VOC since the TensorBoard.dev service has been closed. Many thanks.