frank-xwang / RIDE-LongTailRecognition

[ICLR 2021 Spotlight] Code release for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The diversity loss has no effect?

yypurpose opened this issue · comments

commented

Hi, authors! Thank you for doing such an inspiring job and opening the source code! There is a problem when I using your code, the diversity loss seems not having a great effect.

I run your code "RIDE Without Distill (Stage 1)" of 3 experts on CIFAR100-LT using your config, and got validation accuracy 47.8%. And I tried to do some ablation, I make "additional_diversity_factor"=0.45 (the original setting is -0.45), got validation accuracy 48.0%, which is even 0.2% higher than 47.8%. I didn't change any thing else of your codes. Could you help me figure out the problem?

Thanks a lot!

commented

Update:
I also found a bug in the code:

ride_loss_logits = output_logits if self.additional_diversity_factor == 0 else logits_item
means using "collaborative loss" if diversity loss is 0 (if I didn't misunderstand), while this comparison is unfair (as is said in the paper, "individual loss" is better than "collaborative loss"). So I change this setting, using "individual loss" when w/o diversity loss, achieve 48.7% accuracy when testing.

Here are some logs of my experiments:

  1. individual loss w/ -0.45 KL divergence
    'loss': 2.590021451187134, 'accuracy': 0.4772, 'many_class_num': 35, 'medium_class_num': 35, 'few_class_num': 30, 'many_shot_acc': 0.6554285, 'medium_shot_acc': 0.50257146, 'few_shot_acc': 0.23966669
  2. individual loss w/ 0.45 KL divergence
    'loss': 2.487060943412781, 'accuracy': 0.4798, 'many_class_num': 35, 'medium_class_num': 35, 'few_class_num': 30, 'many_shot_acc': 0.68057144, 'medium_shot_acc': 0.49142852, 'few_shot_acc': 0.23200001
  3. collaborative loss w/o KL divergence
    'loss': 3.084108729171753, 'accuracy': 0.4549, 'many_class_num': 35, 'medium_class_num': 35, 'few_class_num': 30, 'many_shot_acc': 0.64285713, 'medium_shot_acc': 0.46028566, 'few_shot_acc': 0.22933334
  4. individual loss w/o KL divergence
    'loss': 2.51549779586792, 'accuracy': 0.4872, 'many_class_num': 35, 'medium_class_num': 35, 'few_class_num': 30, 'many_shot_acc': 0.68, 'medium_shot_acc': 0.49628574, 'few_shot_acc': 0.25166667

all experiments are of 3 experts, and individual loss w/o KL divergence is the best.

Thanks.

Hi yypurpose, thanks for your insightful question and attention to our project. I have carefully read the numbers that you give as well as the analysis. However, it is very interesting of us to see that your runs have different results than what we have. Here I will briefly explain how we run our experiments, how the results were, and how we get to the current conclusion: we run the experiments that you run five times in total, two with -0.45, the current hyperparm, and two with 0.45, the hyperparam that you offer, and we run the same as your settings with hyperparam as 0 and modify the code the same as the way you describe.

In the case that hyperparam=-0.45, we get 48.82% and 49.00% as our first pair. In the case that hyperparam=0.45, we get 47.74% and 47.75% as our second pair. In the experiment you suggest with a hyperparam as 0, we get 48.25%. The mean of our runs with a hyperparam=-0.45 is about 1.2% higher than the runs with hyperparam=0.45, which is enough to prove that our loss is effective in our runs.

In answer to the results you have, I see that your result is lower than both our formally reported one which is from an old codebase (48.0%), and the ones above that use the new codebase, which is higher than the old codebase that we use in our report (the above ones actually have 48.9% as an average), and this may mean something is actually in a mismatch. However, there are many issues that could lead to the differences. I suggest you try the latest codebase and upgrade your dependencies to make sure we are on the same page, although I admit that this does not solve all the problems and it is hard to give much assistance.

To let you know and compare with our previous experiments, I have put logs of our runs in the following gist for you to take a look and you could check them to validate your environment or process: https://gist.github.com/TonyLianLong/885c8d3d0feb9f775c3a2e5794ba0450, https://gist.github.com/TonyLianLong/9656ca2ce75eb5f167ba3349c69b518e, https://gist.github.com/TonyLianLong/87ef86db3195b243846f8ca024441e31. If you would like, I can also send you the checkpoints so that you could test whether our models behave the same and are evaluated to the same results in evaluation time and there is no precision error at least in inference between us for these models. We also propose to give several updates to our report and codebase (see below), and we will then perform a check when we update them to make sure each part works on our end, so if you think we should look further into it, then don't worry, we have it on our plan already. We will give you updates for anything we discover and would like to tell you. Hope this answers your question.

In addition, I would also like to give you some information on the upcoming revision of our report. This serves as a preview for those who are curious, too (so if you are one of these people, you probably would like to read this paragraph). We acknowledge that there are many things that are not clear in the previous one such as the analysis, presentation, organization, and the formula, which may be misleading, and we will give an update in about several weeks. To relieve you from waiting for the full revised report, I would like to point out the way that we organize our contributions. I will write in a more detailed way because this also answers other questions about potential clarifications, which also potentially solves other people's concerns. The way that we organize our contributions has the following: 1. the multi-experts framework that is applicable to many backbone networks and allows parallel voting and agreements and is designed to efficiently use parameters, 2. the loss function to encourage diversity (this includes two components that provide the same goal and purpose: 2.1. individual loss, which we proposed to replace the collaborative one that was used in former multi-branch models in works like BBN, we found individual loss would then decouple most of the correlation among experts since we apply loss individually to each expert, which contributes about 2% to our improvement without adding computation or change the evaluation or inference formula at test time. 2.2. we found that although implicit random initialization is already effective to encourage diversity, adding an explicit KL divergence to further encourage the disentanglement is beneficial most of the time, although only to a small increase as listed above if you compare to the 5% to 7% overall improvement. We admit that we did not offer enough details and we will add more things, give details, and provide necessary explanations to make it clearer on the effects of the loss.), 3. the routing experts module which effectively uses the multi-expert property to reduce the number of computation in running the model: the diversity will matter if we need it to reduce variance, which is especially useful for tail classes and leads to the preservation of result with fewer computational costs. We are also going to add more experiments and analysis which will bring more details as well as make corrections to several known issues. In addition, we present tricks specific to multi-expert models, such as using a model with 6 experts as a teacher to distill into a model with few experts, as well as analysis such as bias-variance tradeoff in long-tailed datasets. We will update the codebase accordingly to reflect our new update. I think the updated report will present you with a clear view of our project in which we clarify a lot of things.

commented

Thank you for your detailed answer! My opinion is that, if w/o diversity loss, the performance is still at a good level, it's a good thing. No need to concern about the hyper-parameters!

I'm also looking forward to see the new revision of your report!

Thanks.

@yypurpose @TonyLianLong My ablation experiments are consistent with @yypurpose
on cifar100-LT, with all other factors the same, the accuracies on test set are

  1. individual loss + KL loss(without any change of the code): 48.51
  2. collaborative loss(set additional_diversity_factor=0): 46.39
  3. individual loss only(comment the line that add diversity loss): 48.95

it seems diversity loss has no effect, although making multiple experts focus on different things sounds reasonable.

Although this issue has been addressed above, we have checked the checkpoints of all the results above and attached them here, including the results of several runs, log files, and checkpoints: According to them, our results are consistent with the results reported in the paper, and in fact are often higher than the reported results..

Here we provide a summary on the previous experiment results:

  1. Diversity loss (individual+KL) with weight_factor=-0.45:
  1. Diversity loss (individual+KL) with weight_factor=0.45:
  1. Diversity loss (individual only):
  1. Vanilla (collaborative):

You could test our models with the checkpoints above. According to results above, the results that we obtained above strengthen the belief on the diversity loss to be effective. Again, I suggest you try the latest codebase and upgrade your dependencies to make sure we are on the same page, although I admit that it does not solve all the problems and it is hard to give much assistance.

In addition, the intuition is that diversity is very sensitive to many factors. As a thought experiment, if the branches' initializations are highly correlated (to the extreme, same initialization), and then experts will have similar even the same SGD update and same output (since we train on same data), and no improvement will be observed, unless we apply a loss that encourage diversity.

Our loss, which encourages KL divergence, encourages this diversity and removes the bad effect since we are not using other things to control these factors currently.

For the diversity part, in order to encourage multiple experts to make supplementary decisions, we also adopted "individual loss" instead of collaborative losses like BBN. The decopling effect in the individual loss also encourages the multiple experts to produce diversified opinions and the above experiment shows the effectiveness of diversity loss with two parts.

As you can see, the models above were trained half a year ago, and I don't have much time to dig even deeper on this issue right now. We are planning an extension of this project, so we will definitely perform an in-depth analysis to address your concern.

Hope this answers your question and wish you a nice day.