tianyic / only_train_once

OTOv1-v3, NeurIPS, ICLR, TMLR, DNN Training, Compression, Structured Pruning, Erasing Operators, CNN, Diffusion, LLM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How did you design your supernet search space in OTOv3?

5thGenDev opened this issue · comments

Sorry for asking a vague question that encompasses OTOv3 but I'm in a rush in my thesis and I just bumped into your paper. I just need some pointers. Also some extra questions:

  • What encoding scheme did you use to embed different operations within a network architecture? Can it be replaced with another encoding scheme?

@5thGenDev

This is a good question. Here are some high-level points.

  • How to automatically formulate search space of general supernet?

    In general, OTOv3 analyzes the dependency across varying vertices (operators) inside the supernet and creates one dependency graph. The dependency graph is then used to figure out which groups of vertices are removal so that after removing the vertices, the remaining DNN is still valid, i.e., functions normally. In particular, OTOv3 considers a class of such removal structures, i.e., the generalized zero-invariant groups (GeZIG). The set of GeZIG forms the search space of the given supernet.

  • Can be replaced to another encoding scheme?

    Yes. In principle, as long as the encoding scheme can ensure the DNN after removing those vertices being valid. The encoding schema would be valid.

@tianyic
Thank you for your quickly answer, that means a lot.

The dependency graph is then used to figure out which groups of vertices are removal so that after removing the vertices, the remaining DNN is still valid, i.e., functions normally. In particular, OTOv3 considers a class of such removal structures, i.e., the generalized zero-invariant groups (GeZIG). The set of GeZIG forms the search space of the given supernet.
So this is how the search space/supernet gradually reduces redundancy like Progressive-DARTS except that OTOv3's supernet is non-differentiable supernet, which is what makes Hierarchical Half-Space Projected Gradient special (I think?)

I have 3 more questions:

  • Which of the following operators: Self Attention, Cross Attention, MLP, DepthWise Conv are not available for OTOv3 search space due to being removable like skip connection?
  • Is there a "performance predictor" mechanism to predict the performance of constructed subnets?
  • How do you tackle rank disorder problem in every one-shot supernet? "one-shot methods make a key assumption: the ranking of architectures evaluated with the supernet is relatively consistent with the ranking one would obtain from training them independently; when this assumption is not met, it is known as rank disorder" - Neural Architecture Search: Insights from 1000 Papers

@5thGenDev

These are insightful questions. Please see my below responses.

So this is how the search space/supernet gradually reduces redundancy like Progressive-DARTS except that OTOv3's supernet is non-differentiable supernet, which is what makes Hierarchical Half-Space Projected Gradient special (I think?)

Exactly, you are right. For the sake of generality and autonomy, OTOv3 established the search space automatically but currently does not introduce architecture variables to make the supernet differentiable. Therefore, we formulate a single-level hierarchical structured sparsity optimization problem and propose an H2SPG solver to identify redundant removal structures to form subnets. We will leave automatic introduction of auxiliary architecture variables as future work to enable multi-level optimization if needed.

Which of the following operators: Self Attention, Cross Attention, MLP, DepthWise Conv are not available for OTOv3 search space due to being removable like skip connection?

MLP, DepthWise Conv are well supported for OTOv3, e.g., the exampling RegNet contains a lot of DepthWise Conv. We have not tested Self-Atten and Cross Attention yet in OTOv3. But both are supported in the internal OTOv2-LLM (see our working items). In general, the key is to consider them as an entirety since these operators are composed by a set of basic vertices in the trace graph. The integration of such composed operators into OTOv3 is doable but would be left as future work.

Is there a "performance predictor" mechanism to predict the performance of constructed subnets?

H2SPG itself has some redundancy identification mechanisms from the view of sparse optimization, which largely plays the roles of performance predictor. Meanwhile, the framework should be flexible to be integrated with other existing performance predictor after proper engineering integration.

How do you tackle rank disorder problem in every one-shot supernet?

This problem may be related to the sensitivities, which as far as I know is a common issue for the NAS community. In OTOv3, we did not pay special attention to this problem, since it might be too heavy to a 9-page paper :). We indeed observed that the architectures might fluctuate a bit upon varying random seed.

In the end, OTOv3 focuses on autonomy of NAS given general super-networks and proposes a freshing end-to-end algorithmic framework. We and the researchers from the open-source community will build upon the topic to further improve and explore to lower the bar and make the AI more friendly to the end-developers and users.