[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper]
This is Official PyTorch implementation for HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning.
@inproceedings{lee2021help,
title = {HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning},
author = {Lee, Hayeon and Lee, Sewoong and Chong, Song and Hwang, Sung Ju},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2021}
}
- Python 3.8 (Anaconda)
- PyTorch 1.8.1
- CUDA 10.2
Hardware spec used for meta-training the proposed HELP model
- GPU: A single Nvidia GeForce RTX 2080Ti
- CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
$ conda create --name help python=3.8
$ conda activate help
$ conda install pytorch==1.8.1 torchvision cudatoolkit=10.2 -c pytorch
$ pip install nas-bench-201
$ pip install tqdm
$ conda install scipy
$ conda install pyyaml
$ conda install tensorboard
1. Experiments on NAS-Bench-201 Search Space
2. Experiments on FBNet Search Space
3. Experiments on OFA Search Space
4. Experiments on HAT Search Space
We provide the code to reproduce the main results on NAS-Bench-201 search space as follows:
- Computing architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices (Table 3).
- Latency-constrained NAS Results with MetaD2A + HELP on unseen devices (Table 4).
- Meta-Training HELP model.
We include all required datasets and checkpoints in this github repository.
You can compute architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices on NAS-Bench-201 search space (Table 3):
$ python main.py --search_space nasbench201 \
--mode 'meta-test' \
--num_samples 10 \
--num_meta_train_sample 900 \
--load_path [Path of Checkpoint File] \
--meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
--meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \
--meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss'
You can use checkpoint file provided by this git repository ./data/nasbench201/checkpoint/help_max_corr.pt
as follows:
$ python main.py --search_space nasbench201 \
--mode 'meta-test' \
--num_samples 10 \
--num_meta_train_sample 900 \
--load_path './data/nasbench201/checkpoint/help_max_corr.pt' \
--meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
--meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \
--meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss'
or you can use provided script:
$ bash script/run_meta_test_nasbench201.sh [GPU_NUM]
Architecture Ranking Correlation Results (Table 3)
Method | # of Training Samples From Target Device |
Desktop GPU (Titan RTX Batch 256) |
Desktop CPU (Intel Gold 6226) |
Mobile Pixel2 |
Raspi4 | ASIC | FPGA | Mean |
---|---|---|---|---|---|---|---|---|
FLOPS | - | 0.950 | 0.826 | 0.765 | 0.846 | 0.437 | 0.900 | 0.787 |
Layer-wise Predictor | - | 0.667 | 0.866 | - | - | - | - | 0.767 |
BRP-NAS | 900 | 0.814 | 0.796 | 0.666 | 0.847 | 0.811 | 0.801 | 0.789 |
BRP-NAS (+extra samples) |
3200 | 0.822 | 0.805 | 0.693 | 0.853 | 0.830 | 0.828 | 0.805 |
HELP (Ours) | 10 | 0.987 | 0.989 | 0.802 | 0.890 | 0.940 | 0.985 | 0.932 |
You can reproduce latency-constrained NAS results with MetaD2A + HELP on unseen devices on NAS-Bench-201 search space (Table 4):
$ python main.py --search_space nasbench201 --mode 'nas' \
--load_path [Path of Checkpoint File] \
--sampled_arch_path 'data/nasbench201/arch_generated_by_metad2a.txt' \
--nas_target_device [Device] \
--latency_constraint [Latency Constraint]
For example, if you use checkpoint file provided by this git repository, then path of checkpoint file is ./data/nasbench201/checkpoint/help_max_corr.pt
, if you set target device as CPU Intel Gold 6226 (gold_6226
) with batch size 256 and target latency constraint as 11.0 (ms), command is as follows:
$ python main.py --search_space nasbench201 --mode 'nas' \
--load_path './data/nasbench201/checkpoint/help_max_corr.pt' \
--sampled_arch_path 'data/nasbench201/arch_generated_by_metad2a.txt' \
--nas_target_device gold_6226 \
--latency_constraint 11.0
or you can use provided script:
$ bash script/run_nas_metad2a.sh [GPU_NUM]
Efficient Latency-constrained NAS Results (Table 4)
Device | # of Training Samples from Target Device |
Latency Constraint (ms) |
Latency (ms) |
Accuracy (%) |
Neural Architecture Config |
---|---|---|---|---|---|
GPU Titan RTX (Batch 256) titan_rtx_256 |
10 | 18.0 21.0 25.0 |
17.8 18.9 24.2 |
69.7 71.5 71.8 |
link link link |
CPU Intel Gold 6226 gold_6226 |
10 | 8.0 11.0 14.0 |
8.0 10.7 14.3 |
67.3 70.2 72.1 |
link link link |
Mobile Pixel2 pixel2 |
10 | 14.0 18.0 22.0 |
13.0 19.0 25.0 |
69.7 71.8 73.2 |
link link link |
ASIC-Eyeriss eyeriss |
10 | 5.0 7.0 9.0 |
3.9 5.1 9.1 |
71.5 71.8 73.5 |
link link link |
FPGA fpga |
10 | 4.0 5.0 6.0 |
3.8 4.7 7.4 |
70.2 71.8 73.5 |
link link link |
Note that this process is performed only once for all NAS results.
$ python main.py --search_space nasbench201 \
--mode 'meta-train' \
--num_samples 10 \
--num_meta_train_sample 900 \
--meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
--meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \
--meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss' \
--exp_name [EXP_NAME] \
--seed 3 # e.g.) 1, 2, 3
or you can use provided script:
$ bash script/run_meta_training_nasbench201.sh [GPU_NUM]
The results (checkpoint file, log file etc) are saved in
./results/nasbench201/[EXP_NAME]
We provide the code to reproduce the main results on FBNet search space as follows:
- Computing architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices (Table 2).
- Meta-Training HELP model.
We include all required datasets and checkpoints in this github repository.
You can compute architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices on FBNet search space (Table 2):
$ python main.py --search_space fbnet \
--mode 'meta-test' \
--num_samples 10 \
--num_episodes 4000 \
--num_meta_train_sample 4000 \
--load_path './data/fbnet/checkpoint/help_max_corr.pt' \
--meta_train_devices '1080ti_1,1080ti_32,1080ti_64,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
--meta_valid_devices 'titanx_1,titanx_32,titanx_64,gold_6240' \
--meta_test_devices 'fpga,raspi4,eyeriss'
or you can use provided script:
$ bash script/run_meta_test_fbnet.sh [GPU_NUM]
Architecture Ranking Correlation Results (Table 2)
Method | Raspi4 | ASIC | FPGA | Mean |
---|---|---|---|---|
MAML | 0.718 | 0.763 | 0.727 | 0.736 |
Meta-SGD | 0.821 | 0.822 | 0.776 | 0.806 |
HELP (Ours) | 0.887 | 0.943 | 0.892 | 0.910 |
Note that this process is performed only once for all results.
$ python main.py --search_space fbnet \
--mode 'meta-train' \
--num_samples 10 \
--num_episodes 4000 \
--num_meta_train_sample 4000 \
--exp_name [EXP_NAME] \
--meta_train_devices '1080ti_1,1080ti_32,1080ti_64,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
--meta_valid_devices 'titanx_1,titanx_32,titanx_64,gold_6240' \
--meta_test_devices 'fpga,raspi4,eyeriss' \
--seed 3 # e.g.) 1, 2, 3
or you can use provided script:
$ bash script/run_meta_training_fbnet.sh [GPU_NUM]
The results (checkpoint file, log file etc) are saved in
./results/fbnet/[EXP_NAME]
We provide the code to reproduce the main results on OFA search space as follows:
- Latency-constrained NAS Results with accuracy predictor of OFA + HELP on unseen devices (Table 5).
- Validating obatined neural architecture on ImageNet-1K.
- Meta-Training HELP model.
We include required datasets except ImageNet-1K, and checkpoints in this github repository. To validate obatined neural architecture on ImageNet-1K, you should download ImageNet-1K (2012 ver.)
You can reproduce latency-constrained NAS results with OFA + HELP on unseen devices on OFA search space (Table 5):
python main.py \
--search_space ofa \
--mode nas \
--num_samples 10 \
--seed 3 \
--num_meta_train_sample 4000 \
--load_path './data/ofa/checkpoint/help_max_corr.pt' \
--nas_target_device [DEVICE_NAME] \
--latency_constraint [LATENCY_CONSTRAINT] \
--exp_name 'nas' \
--meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
--meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
--meta_test_devices 'titan_rtx_64'
For example,
$ python main.py \
--search_space ofa \
--mode nas \
--num_samples 10 \
--seed 3 \
--num_meta_train_sample 4000 \
--load_path './data/ofa/checkpoint/help_max_corr.pt' \
--nas_target_device titan_rtx_64 \
--latency_constraint 20 \
--exp_name 'nas' \
--meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
--meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
--meta_test_devices 'titan_rtx_64'
or you can use provided script:
$ bash script/run_nas_ofa.sh [GPU_NUM]
Efficient Latency-constrained NAS Results (Table 5)
Device | Sample from Target Device |
Latency Constraint (ms) |
Latency (ms) |
Accuracy (%) |
Architecture config |
---|---|---|---|---|---|
GPU Titan RTX (Batch 64) |
10 | 20 23 28 |
20.3 23.1 28.6 |
76.0 76.8 77.9 |
link link link |
CPU Intel Gold 6226 | 20 | 170 190 |
147 171 |
77.6 78.1 |
link link |
Jetson AGX Xavier | 10 | 65 70 |
67.4 76.4 |
75.9 76.4 |
link link |
$ python validate_imagenet.py \
--config_path [Path of neural architecture config file]
--imagenet_save_path [Path of ImageNet 1k]
for example,
$ python validate_imagenet.py \
--config_path 'data/ofa/architecture_config/gpu_titan_rtx_64/latency_28.6ms_accuracy_77.9.json' \
--imagenet_save_path './ILSVRC2012'
Note that this process is performed only once for all results.
$ python main.py --search_space ofa \
--mode 'meta-train' \
--num_samples 10 \
--num_meta_train_sample 4000 \
--exp_name [EXP_NAME] \
--meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
--meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
--meta_test_devices 'titan_rtx_64' \
--seed 3 # e.g.) 1, 2, 3
or you can use provided script:
$ bash script/run_meta_training_ofa.sh [GPU_NUM]
We provide the neural architecture configurations to reproduce the results of machine translation (WMT'14 En-De Task) on HAT search space.
Efficient Latency-constrained NAS Results
Task | Device | Samples from Target Device |
Latency | BLEU score | Architecture Config |
---|---|---|---|---|---|
WMT'14 En-De | GPU NVIDIA Titan RTX | 10 | 74.0ms 106.5ms |
27.19 27.44 |
link link |
WMT'14 En-De | CPU Intel Xeon Gold 6240 | 10 | 159.6ms 343.2ms |
27.20 27.52 |
link link |
You can test models by BLEU score and Computing Latency.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (ICML17)
Meta-SGD: Learning to Learn Quickly for Few-Shot Learning
Once-for-All: Train One Network and Specialize it for Efficient Deployment (ICLR20)
NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search (ICLR20)
BRP-NAS: Prediction-based NAS using GCNs (NeurIPS20)
HAT: Hardware Aware Transformers for Efficient Natural Language Processing (ACL20)
Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets (ICLR21)
HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark (ICLR21)