[Conceptual question] Training Stage 1 only is better than full training when # experts is 4

Question

[Conceptual question] Training Stage 1 only is better than full training when # experts is 4

jd730 opened this issue 3 years ago · comments

Describe the error
A clear and concise description of what your question is.

Hi, thank you for sharing your code. I have a question regarding the performance. From my understanding, there are several methods that boost performance such as diversity loss and router.

I actually supposed that diversity loss is the main contributor to performance gain by ensemble effect. I test how much the router contributes to performance with four experts (In Table 4, the comparison is conducted with two experts).

When I ran python train.py -c "configs/config_imbalance_cifar100_ride.json" --reduce_dimension 1 --num_experts 4, the test performance was 49.58.

{'loss': 2.396228114891052, 'accuracy': 0.4958, 'many_class_num': 35, 'medium_class_num': 35, 'few_class_num': 30, 'many_shot_acc': 0.68428576, 'medium_shot_acc': 0.51285714, 'few_shot_acc': 0.256}

On the other hand, if I trained EA as well by python train.py -c "configs/config_imbalance_cifar100_ride_ea.json" -r saved/models/Imbalance_CIFAR100_LT_RIDE/0110_143024/model_best.pth --reduce_dimension 1 --num_experts 4, the test performance was just 49.1 which is the same as the reported score on Table 4 (with distillation version).

{'loss': 2.6827446435928346, 'accuracy': 0.4914, 'top_k_acc': 0.7724, 'many_class_num': 35, 'medium_class_num': 35, 'few_class_num': 30, 'many_shot_acc': 0.6851429, 'medium_shot_acc': 0.5065715, 'few_shot_acc': 0.24766666}

Is the purpose of EA reducing computation only (i.e., GFLOPs)?

Thank you so much.

Mention the person who manage the part
Please use git blame to find the person to ping. Mention this person in the issue with "@". Otherwise the right person may not be notified.

Put an x in this checkbox ONLY AFTER you fully read the guidelines about what to put in each type of issue. We will try our best to address your concerns. However, if you do not follow the guidelines, we may not be able to respond. If we miss your issue, send us an email.

XuDong Frank Wang · Answer 1 · Sun Jan 10 2021 15:42:57 GMT+0800 (China Standard Time)

Hi, the expert assignment module is proposed to dynamically route cascaded experts, which assigns more ambiguous (or hard) instances to additional experts, thereby reducing the overall computational complexity. In our local experiments, the EA module should only slightly reduce the top-1 accuracy (within 0.2%). You can tune the hyper-parameter "pos_weight" if you think the EA module consumes too much compute power or is using too few experts.

For the performance, we found that the CIFAR100-LT results reproduced with this codebase are generally higher (about 0.4%-0.6%) than what was reported in our paper, therefore, the performance of "RIDE without distillation", reproduced with this codebase, may achieve similar performance to the reported performance of "RIDE with distillation loss".

Please let us know if you have any other questions.