There are 2 repositories under efficient-inference topic.
Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling
EfficientFormerV2 [ICCV 2023] & EfficientFormer [NeurIPs 2022]
Code for paper " AdderNet: Do We Really Need Multiplications in Deep Learning?"
List of papers related to neural network quantization in recent AI conferences and journals.
[NeurIPS 2024 Spotlight]"LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS", Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
Learning Efficient Convolutional Networks through Network Slimming, In ICCV 2017.
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
📚 Collection of awesome generation acceleration resources.
Explorations into some recent techniques surrounding speculative decoding
[ECCV2022] Efficient Long-Range Attention Network for Image Super-resolution
(CVPR 2021, Oral) Dynamic Slimmable Network
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
[ECCV 2022] Official implementation of the paper "DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation"
Official code repository for Sketch-of-Thought (SoT)
[NeurIPS 2024] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
[NeurIPS'23] Speculative Decoding with Big Little Decoder
[ICLR 2022] Code for Graph-less Neural Networks: Teaching Old MLPs New Tricks via Distillation (GLNN)
[NeurIPS'24] Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
Implementation of AAAI 21 paper: Nested Named Entity Recognition with Partially Observed TreeCRFs
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
[CVPR2025] Code Release for "FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting"
Jia-Hong Lee, Yi-Ming Chan, Ting-Yen Chen, and Chu-Song Chen, "Joint Estimation of Age and Gender from Unconstrained Face Images using Lightweight Multi-task CNN for Mobile Applications," IEEE International Conference on Multimedia Information Processing and Retrieval, MIPR 2018
[ECCV 2020] Code release for "Resolution Switchable Networks for Runtime Efficient Image Recognition"
Code for WF-IoT paper 'TinyML Benchmark: Executing Fully Connected Neural Networks on Commodity Microcontrollers'
🚀 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models