ziplab / SN-Net

[CVPR 2023 Highlight] This is the official implementation of "Stitchable Neural Networks".

Home Page:https://snnet.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Motivation for sn-net

wujiafu007 opened this issue · comments

The meaning of knowledge distillation is that a small model can achieve performance comparable to a large model. What is the significance of your model stitching? I can't get the logic of this direction.

Good question! @wujiafu007

SN-Net aims to obtain many networks by efficiently stitching any off-the-shelf pretrained model families, such that a single model can satisfy different accuracy-speed trade-offs at runtime without any additional training cost. This is fundamentally different from KD where it aims to distill the knowledge into a single smaller model and train it from scratch.

Hope this helps!

@HubHop What I mean is that this doesn't provide substantial assistance for practical industrial implementation. For example, if you assemble a Swin-T-sized model, its performance would only be at a tiny level. However, through knowledge distillation, it is possible to achieve performance close to that of a larger model (referred to as M) even at the tiny scale. Furthermore, this approach is not a zero computational cost combination. As reported in the paper, it still requires 50 epochs of training. Taking the standard 300 epochs of Imagenet-1K training as an example, training for an additional 50 epochs would indeed yield some improvement over the baseline model.

Certainly, respectfully speaking, this is an interesting work; however, I struggle to identify its practical application scenarios.

I believe there is a misunderstanding. SN-Net does not aim to improve the performance of a single neural network like KD, instead, it connects different accuracy-speed trade-offs in a pretrained model family.

Here are some notes.

  1. We don't assemble Swin-T sized model. We assemble the entire model family, which includes Swin-T/S/B. Moreover, each model can be pretrained with KD to further improve the performance, e.g., KD improved Swin-T/S/B. Remember that the foundation of SN-Net is efficiently utilise the large number of pretrained models in model zoo, which is orthogonal to the pretraining stage.
  2. With KD, we may have a single small model that achieves almost comparable performance with a large model. But we can also do the same thing to a large model such that it achieves better performance as well. Essentially, there exists a performance-complexity gap between models of different sizes.
  3. Different sized models have different application scenarios and advantages. A small model is energy-efficient but may not handle complicated tasks very well. In contrast, a large model usually performs better but suffers from huge energy consumption. SN-Net provides an efficient way to obtain these different sized models in a single network. This can be particularly useful for automonous driving or other energy-sensitive applications.
  4. The 50 epochs of training does not aim to improve the performance of trained anchors, e.g. DeiT-Ti/S/B. It is a joint-training process for SN-Net that interpolates the performance-complexity curves between different anchors, such that we can obtain numerous networks after training.

What I mean is that, with the same computational cost (50 epochs), the model obtained by assembling the SN-Net does not perform as well as the one obtained through knowledge distillation. Another point is that model size is not a problem. For instance, if I want to obtain a model of a specific size between Ti and S, I can simply insert additional blocks of the same type in between. I believe the primary motivation in your direction is actually the interpretability of CV backbones, which involves clustering different blocks from different models. For example, some blocks may lean towards capturing fine-grained semantic details, while others may focus on abstract semantics. There is a paper titled "Deep Model Reassembly" presented at NeurIPS 2022 that explores similar clustering approaches.

The major significance of this direction is to understand what each block is actually doing, which can help leverage their strengths and weaknesses to improve model performance. As for obtaining models of different sizes, I agree that it may not hold significant importance and cannot serve as a primary motivation in this context.

Good to see you have your own inspirations.