Kyrie-Zhao / awesome-real-time-AI

This is a list of awesome edgeAI inference related papers.

Awesome Real-time AI []

This is a list of awesome real-time AI and DNN inference related projects & papers.

Contents

Benchmark and Dataset
Open Source Projects
Papers

Benchmark and Dataset

Open Source Projects

Papers

Survey

Edge Intelligence: Architectures, Challenges, and Applications by Xu, Dianlei, et al., arxiv 2020
A Survey of Multi-Tenant Deep Learning Inference on GPU by Yu, Fuxun, et al., arxiv 2022
Machine Learning in Real-Time Internet of Things (IoT) Systems: A Survey by Bian, Jiang, et al., IOTJ 2022
AI Augmented Edge and Fog Computing: Trends and Challenges by Tuli S, Mirhakimi F, Pallewatta S, et al., arxiv 2022
Enable deep learning on mobile devices: Methods, systems, and applications by Cai, Han, et al., TODAES 2022
Multi-DNN Accelerators for Next-Generation AI Systems by Venieris, Stylianos I., Christos-Savvas Bouganis, and Nicholas D. Lane., arxiv 2022
A Survey of GPU Multitasking Methods Supported by Hardware Architecture Zhao, Chen, et al., IEEE TPDS 2021
The Future of Consumer Edge-AI Computing by Laskaridis, Stefanos, et al., arxiv 2022
DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack by Gibson, Perry, et al., arxiv 2023

DNN Compiler

Moses: Exploiting Cross-device Transferable Features for On-device Tensor Program Optimization by Zhao, Zhihe, et al., HotMobile 2023
TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures by Zhen Zheng et al., ASPLOS 2022
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections by Haojie Wang et al., OSDI 2021
Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks by Lingxiao Ma et al., OSDI 2020
TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance by Jiarong Xing et al., MLSys 2022
Ansor: Generating High-Performance Tensor Programs for Deep Learning by Lianmin Zheng et al., OSDI 2020
TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers by Lianmin Zheng., NeurIPS 2021
Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs by Liang, Rendong, et al., MobiCom 2022
Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus by Wang, Manni, et al., MobiCom 2021
Ios: Inter-operator scheduler for cnn acceleration by Ding, Yaoyao, et al., MLSys 2021
DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads by Jung, Wookeun, Thanh Tuan Dao, and Jaejin Lee., PLDI 2021
CASE: a compiler-assisted SchEduling framework for multi-GPU systems by Chen, Chao, Chris Porter, and Santosh Pande., PPoPP 2022
Chameleon: Adaptive code optimization for expedited deep neural network compilation by Ahn, Byung Hoon, et al., arxiv 2020
Analytical characterization and design space exploration for optimization of CNNs by Li, Rui, et al., ASPLOS 2021
DNNFusion: accelerating deep neural networks execution with advanced operator fusion by Niu, Wei, et al., PLDI 2021
AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU by Bai, Yang, et al., ICCAD 2021
DietCode: Automatic Optimization for Dynamic Tensor Programs by Zheng, Bojian, et al., MLSys 2022
ROLLER: Fast and Efficient Tensor Compilation for Deep Learning by Zhu, Hongyu, et al., OSDI 2022
FamilySeer: Towards Optimized Tensor Codes by Exploiting Computation Subgraph Similarity by Zhang, Shanjun, et al., arxiv 2022
Reusing Auto-Schedules for Efficient DNN Compilation by Gibson, Perry, and José Cano., arxiv 2022
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs by Ding, Yaoyao, et al., arxiv 2022
Cortex: A Compiler for Recursive Deep Learning Models by Fegade, Pratik, et al., MLSys 2021
SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction by Lin, Zhiqi, et al., arxiv 2023
Seastar: Vertex-Centric Programming for Graph Neural Networks by Wu, Yidi, et al., EuroSys 2021
On Optimizing the Communication of Model Parallelism by Zhuang, Yonghao, et al., MLSys 2023
ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations by Xu, Zhiying, et al., arxiv 2022
AGO: Boosting Mobile AI Inference Performance by Removing Constraints on Graph Optimization by Xu, Zhiying, Hongding Peng, and Wei Wang., INFOCOM 2023
Enabling Data Movement and Computation Pipelining in Deep Learning Compiler by Huang, Guyue, et al., MLSys 2023
Automatic Horizontal Fusion for GPU Kernels by Li, Ao, et al., CGO 2022
Compiler Framework for Optimizing Dynamic Parallelism on GPUs by Olabi, Mhd Ghaith, et al., CGO 2022
Transfer-Tuning: Reusing Auto-Schedules for Efficient Tensor Program Code Generation by Gibson, Perry, and José Cano., PACT 2022
Nnsmith: Generating diverse and valid test cases for deep learning compilers by Liu, Jiawei, et al., ASPLOS 2023
Codon: A Compiler for High-Performance Pythonic Applications and DSLs by Shajii, Ariya, et al., CC 2023
CMLCompiler: A Unified Compiler for Classical Machine Learning by Wen, Xu, et al., arxiv 2023
VeGen: A Vectorizer Generator for SIMD and Beyond by Chen, Yishen, et al., ASPLOS 2021
Composable and Modular Code Generation in MLIR by Vasilache, Nicolas, et al., arxiv 2022
TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment by Liu, Hsin-I. Cindy, et al., arxiv 2022
High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR Some Early Results by Katel, Navdeep, Vivek Khandelwal, and Uday Bondhugula., arxiv 2021
Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform by Zhang, Shiwei, et al., arxiv 2023
Triton: an intermediate language and compiler for tiled neural network computations by Tillet, Philippe, Hsiang-Tsung Kung, and David Cox., SIGPLAN Workshop 2019
Flashattention: Fast and memory-efficient exact attention with io-awareness by Dao, Tri, et al., NeurIPS 2022
Graphene: An IR for Optimized Tensor Computations on GPUs by Hagedorn, Bastian, et al., ASPLOS 2023
Tensorir: An abstraction for automatic tensorized program optimization by Feng, Siyuan, et al., ASPLOS 2023
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning by Ye, Zihao, et al., ASPLOS 2023
Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators by Bi, Jun, et al., ASPLOS 2023
Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system by Zheng, Size, et al., ASPLOS 2020
AutoMap: Automatic Mapping of Neural Networks to Deep Learning Accelerators for Edge Devices by Wang, Yanhong, et al., TCAD 2022
Optimizing Dynamic Neural Networks with Brainstorm by Cui, Weihao, et al., OSDI 2023
EINNET: Optimizing Tensor Programs with Derivation-Based Transformations by Zheng, Liyan, et al., OSDI 2023
Welder: Scheduling Deep Learning Memory Access via Tile-graph by Shi, Yining, et al., OSDI 2023
TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs by Mangpo, Phitchaya, et al., arxiv 2023
Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation by Verma, Gaurav, et al., arxiv 2023
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their {Domain-Specific} Accelerators by Zhao, Jie, et al., arxiv 2023
Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations by Jia, Fucheng, et al. arxiv 2023
Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning by Zhang, Chen, et al., OSDI 2023
Revisiting the Evaluation of Deep Learning-Based Compiler Testing by Tian, Yongqiang, et al., IJCAI 2023
Learning Compiler Pass Orders using Coreset and Normalized Value Prediction by Liang, Youwei, et al., ICML 2023
Learning to make compiler optimizations more effective by Mammadli, Rahim, et al. PLDI 2021
C2TACO: Lifting Tensor Code to TACO by Magalhães, José Wesley de Souza, et al., GPCE 2023
TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs by Phothilimthana, Phitchaya Mangpo, et al., arxiv 2023
Autotuning convolutions is easier than you think by Tollenaere, Nicolas, et al., ACM TACO 2023

DNN Extraction

DnD: A Cross-Architecture Deep Neural Network Decompiler by Wu, Ruoyu, et al., USENIX Security 22
Decompiling x86 Deep Neural Network Executables by Liu, Zhibo, et al., USENIX Security 23
LibSteal: Model Extraction Attack towards Deep Learning Compilers by Reversing DNN Binary Library by Zhang, Jinquan, Pei Wang, and Dinghao Wu., ENASE 2023
Reverse engineering convolutional neural networks through side-channel information leaks by Hua, Weizhe, Zhiru Zhang, and G. Edward Suh., DAC 2018
Cache telepathy: Leveraging shared resource attacks to learn {DNN} architectures by Yan, Mengjia, Christopher W. Fletcher, and Josep Torrellas., USENIX Security 2020
Knockoff nets: Stealing functionality of black-box models by Orekondy, Tribhuvanesh, Bernt Schiele, and Mario Fritz., CVPR 2019
Deepsniffer: A dnn model extraction framework based on learning architectural hints by Hu, Xing, et al., ASPLOS 2020
I know what you trained last summer: A survey on stealing machine learning models and defences by Oliynyk, Daryna, Rudolf Mayer, and Andreas Rauber., ACM Computing Surveys 2023
Deepsteal: Advanced model extractions leveraging efficient weight stealing in memories by Rakin, Adnan Siraj, et al., S&P 2022
Hermes attack: Steal {DNN} models with lossless inference accuracy by Zhu, Yuankun, et al., USENIX Security 2021
Stealing machine learning models via prediction {APIs} by Tramèr, Florian, et al., USENIX Security 2016
SoK: Demystifying Binary Lifters Through the Lens of Downstream Applications by Liu, Zhibo, et al., S&P 2022
HuffDuff: Stealing Pruned DNNs from Sparse Accelerators by Yang, Dingqing, Prashant J. Nair, and Mieszko Lis., ASPLOS 2023

Edge-Cloud Collaborative Inference

EdgeML: An AutoML framework for real-time deep learning on the edge by Zhao, Zhihe, et al., IoTDI 2021
SPINN: synergistic progressive inference of neural networks over device and cloud by Laskaridis, Stefanos, et al., MobiCom 2020
Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud by Huang, Jin, et al., MobiCom 2020
Neurosurgeon: Collaborative intelligence between the cloud and mobile edge by Kang, Yiping, et al., ASPLOS 2017
Mistify: Automating DNN Model Porting for On-Device Inference at the Edge by Guo, Peizhen, et al., NSDI 2021
Deep compressive offloading: Speeding up neural network inference by trading edge computation for network latency. by Yao, Shuochao, et al., SenSys 2020
Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading by Zhang, Wuyang, et al., MobiCom 2021
Edge assisted real-time object detection for mobile augmented reality by Liu, Luyang, Hongyu Li, and Marco Gruteser., MobiCom 2019
Mistify: Automating dnn model porting for on-device inference at the edge by Guo, Peizhen, Bo Hu, and Wenjun Hu., NSDI 2021
AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments by Wen, Hao, et al., MobiCom 2023

Concurrent DNN Inference

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling by Liu, Zihan, et al., ASPLOS 2021
RT-mDL: Supporting Real-Time Mixed Deep Learning Tasks on Edge Platforms by Ling, Neiwen, et al., SenSys 2021
Horus: Interference-aware and prediction-based scheduling in deep learning systems by Yeung, Gingfung, et al., IEEE TPDS 2021
Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU by Yu, Fuxun, et al., ICCAD 2021
Interference-aware scheduling for inference serving by Mendoza, Daniel, et al., EuroMLSys 2021
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences by Han, Mingcong, et al., OSDI 2022
Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks by Ghodrati, Soroush, et al., MICRO 2020
Heimdall: mobile GPU coordination platform for augmented reality applications by Yi, Juheon, and Youngki Lee., MobiCom 2020
Deepeye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware by Mathur, Akhil, et al., MobiSys 2017
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications by Bai, Zhihao, et al., OSDI 2020
Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction by Cui, Weihao, et al., SC 2021
LegoDNN: block-grained scaling of deep neural networks for mobile vision by Han, Rui, et al., MobiCom 2021
NeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems by Bateni, Soroush, and Cong Liu., ATC 2020
Multi-Neural Network Acceleration Architecture by Baek, Eunjin, Dongup Kwon, and Jangwoo Kim., ISCA 2020
Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference by Xiang, Yecheng, and Hyoseung Kim., RTSS 2019
Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision by Fang, Biyi, Xiao Zeng, and Mi Zhang., MobiCom 2018
Flep: Enabling flexible and efficient preemption on gpus by Wu, Bo, et al., ASPLOS 2017
Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers by Chen, Quan, et al., ASPLOS 2017
PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences by Zhang, Shulai, et al., ICS 2022
Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling by Oh, Young H., et al., HPCA 2021
LiteReconfig: cost and content aware reconfiguration of video object detection systems for mobile GPUs by Xu, Ran, et al., EuroSys 2022
ApproxNet: Content and contention-aware video object classification system for embedded clients by Xu, Ran, et al.
Accelerating deep learning workloads through efficient multi-model execution by Narayanan, Deepak, et al., NeurIPS Workshop 2018
OLPart: Online Learning based Resource Partitioning for Colocating Multiple Latency-Critical Jobs on Commodity Computers by Chen, Ruobing, et al., EuroSys 2023
MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks by Kim, Seah, et al., HPCA 2023
Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices by Sung, Hsin-Hsuan, et al., ATC 2023

Heterogeneous Platforms

Lalarand: Flexible layer-by-layer cpu/gpu scheduling for real-time dnn tasks by Kang, Woosung, et al., RTSS 2021
DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture by Zhang, Minjia, Zehua Hu, and Mingqin Li., IPDPS 2021
Band: coordinated multi-DNN inference on heterogeneous mobile processors by Jeong, Joo Seong, et al., MobiSys 2022
ODMDEF: On-Device Multi-DNN Execution Framework Utilizing Adaptive Layer-Allocation on General Purpose Cores and Accelerator by Lim, Cheolsun, and Myungsun Kim., IEEE ACCESS 2021
μlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization by Kim, Youngsok, et al., EuroSys 2019
OPTiC: Optimizing collaborative CPU–GPU computing on mobile devices with thermal constraints by Wang, Siqi, Gayathri Ananthanarayanan, and Tulika Mitra., TCAD 2019
Accelerating Sequence-to-Graph Alignment on Heterogeneous Processors by Feng, Zonghao, and Qiong Luo., ICPP 2021
Efficient Execution of Deep Neural Networks on Mobile Devices with NPU by Tan, Tianxiang, and Guohong Cao., IPSN 2021
CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices by Jia, Fucheng, et al., MobiSys 2022
Coda: Improving resource utilization by slimming and co-locating dnn and cpu jobs by Zhao, Han, et al. ICDCS 2020

HPC and Archs

GPUReplay: a 50-KB GPU stack for client ML by Park, Heejin, and Felix Xiaozhu Lin., ASPLOS 2022
Real-time high performance computing using a Jetson Xavier AGX by Cetre, Cyril, et al., ERTS 2022
GPU scheduling on the NVIDIA TX2: Hidden details revealed by Amert, Tanya, et al., RTSS 2017
Nimble: Lightweight and parallel gpu task scheduling for deep learning by Kwon, Woosuk, et al., NeurIPS 2020
Addressing GPU on-chip shared memory bank conflicts using elastic pipeline by Gou, Chunyang, and Georgi N. Gaydadjiev., IJPP 2013
A study of persistent threads style GPU programming for GPGPU workloads by Gupta, Kshitij, Jeff A. Stuart, and John D. Owens., IEEE 2012
Demystifying the placement policies of the NVIDIA GPU thread block scheduler for concurrent kernels by Gilman, Guin, et al., ACM SIGMETRICS Performance Evaluation Review 2021
Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks by Zhao, Han, et al., ICDC 2021
Online Thread Auto-Tuning for Performance Improvement and Resource Saving by Luan, Guangqiang, et al., IEEE TPDS 2021
Hsm: A hybrid slowdown model for multitasking gpus by Zhao, Xia, Magnus Jahre, and Lieven Eeckhout., ASPLOS 2020
Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations by Wu, Bo, et al., ACM ICS 2015
Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming by Xu, Qiumin, et al., ISCA 2016
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling by Zhong, Jianlong, and Bingsheng He. IEEE TPDS 2013
Improving GPGPU concurrency with elastic kernels by Pai, Sreepathi, Matthew J. Thazhuthaveetil, and Ramaswamy Govindarajan., ACM SIGARCH Computer Architecture News 2013
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Kayıran, Onur, et al. ICPCT 2013
Orion: A framework for gpu occupancy tuning by Hayes, Ari B., et al., International Middleware Conference. 2016
Efficient performance estimation and work-group size pruning for OpenCl kernels on GPUs by Wang, Xiebing, et al., IEEE TPDS 2019
Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters by Bian, Zhengda, et al., SC 2021
Autotuning GPU kernels via static and predictive analysis by Lim, Robert, Boyana Norris, and Allen Malony., IEEE ICPP 2017
Gslice: controlled spatial sharing of gpus for a scalable inference platform by Dhakal, Aditya, Sameer G. Kulkarni, and K. K. Ramakrishnan., SOCC 2020
Fractional GPUs: Software-based compute and memory bandwidth reservation for GPUs by Jain, Saksham, et al., RTAS 2019
Effisha: A software framework for enabling effficient preemptive scheduling of gpu by Chen, Guoyang, et al., PPoPP 2017
Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs by Mukunoki, Daichi, Toshiyuki Imamura, and Daisuke Takahashi., MCSOC 2016
FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs by López-Albelda, Bernabé, et al., The Journal of Supercomputing 2022
Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing Wang, Zhenning, et al., HPCA 2016
Optimum: Runtime Optimization for Multiple Mixed Model Deployment Deep Learning Inference by Kaicheng, Guo, et al., preprint 2022
Exploring AMD GPU scheduling details by experimenting with “worst practices” by Otterness, Nathan, and James H. Anderson., RTNS 2021
Making Powerful Enemies on NVIDIA GPUs by Yandrofski, Tyler, et al., RTSS 2022
Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads by Zahaf, Houssam-Eddine, et al., RTNS 2021
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications by Bai, Zhihao, et al., OSDI 2020
Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent by Weng, Qizhen, et al., ATC 2023
VectorVisor: A Binary Translation Scheme for Throughput-Oriented GPU Acceleration by Ginzburg, Samuel, Mohammad Shahrad, and Michael J. Freedman., ATC 2023
Arbitor: A Numerically Accurate Hardware Emulation Tool for DNN Accelerators by Jiang, Chenhao, et al., ATC 2023

Latency Predictor

MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge by Abbasi, Saad, Alexander Wong, and Mohammad Javad Shafiee., arxiv 2022
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices by Nair, Saeejith, et al., CVPR 2022
Maple: Microprocessor a priori for latency estimation by Abbasi, Saad, Alexander Wong, and Mohammad Javad Shafiee., CVPR 2022
nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices by Zhang, Li Lyna, et al., MobiSys 2021
Predicting and reining in application-level slowdown on spatial multitasking GPUs by Wei, Mengze, et al., JPDC 2020
A model-based software solution for simultaneous multiple kernels on GPUs by Wu, Hao, et al., TACO 2020
Smcompactor: a workload-aware fine-grained resource management framework for gpgpus by Chen, Qichen, et al., SAC 2021
Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training by Geoffrey, X. Yu, et al., ATC 2021

TinyML

Mcunet: Tiny deep learning on iot devices by Lin, Ji, et al. , NeurIPS 2020
TinyML: Current Progress, Research Challenges, and Future Roadmap by Shafique, Muhammad, et al., DAC 2021
Benchmarking TinyML systems: Challenges and direction by Banbury, Colby R., et al., arxiv 2020
μNAS: Constrained Neural Architecture Search for Microcontrollers by Liberis, Edgar, Łukasz Dudziak, and Nicholas D. Lane., EuroMLSys 2021
Memory-efficient Patch-based Inference for Tiny Deep Learning by Lin, Ji, et al., NeurIPS 2021
Deep Learning on Microcontrollers: A Study on Deployment Costs and Challenge by Filip Svoboda, Javier Fernandez-Marques, Edgar Liberis, Nicholas D Lane, EuroMLSys 2022
Space-Efficient TREC for Enabling Deep Learning on Microcontrollers by Liu, Jiesong, et al., ASPLOS 2023
Yono: Modeling multiple heterogeneous neural networks on microcontrollers by Kwon, Young D., Jagmohan Chauhan, and Cecilia Mascolo, IPSN 2022

Multi-modality Inference

Dynamic Multimodal Fusion by Xue, Zihui, and Radu Marculescu., arxiv 2022
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action by Shah, Dhruv, et al., arxiv 2022
Accelerating mobile audio sensing algorithms through on-chip gpu offloading by Georgiev, Petko, et al., MobiSys 2017
Enabling Edge Devices that Learn from Each Other: Cross Modal Training for Activity Recognition by Xing, Tianwei, et al., EdgeSys 2018

Sparse Inference

SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute by Zheng, Ningxin, et al., OSDI 2022
ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition by Li, Shiyu, et al., MICRO 2021
A high-performance sparse tensor algebra compiler in Multi-Level IR by Tian, Ruiqin, et al., arxiv 2021
Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction by Huang, Guyue, et al., arxiv 2021
COEXE: An Efficient Co-execution Architecture for Real-Time Neural Network Services by Liu, Chubo, et al., DAC 2020
TorchSparse: Efficient Point Cloud Inference Engine by Tang, Haotian, et al., MLSys 2022

Privacy-aware Inference

SecureTVM: A TVM-Based Compiler Framework for Selective Privacy-Preserving Neural Inference by Huang, Po-Hsuan, et al., TODAES 2023
PolyMPCNet: Towards ReLU-free Neural Architecture Search in Two-party Computation Based Private Inference by Peng, Hongwu, et al., arxiv 2023
Cheetah: Lean and Fast Secure Two-Party Deep Neural Network Inference by Huang, Zhicong, et al., IACR Cryptol 2022

Distributed Inference

Exploring Collaborative Distributed Diffusion-Based AI-Generated Content (AIGC) in Wireless Networks by Du, Hongyang, et al., arxiv 2023
Distributed inference with deep learning models across heterogeneous edge devices by Hu, Chenghao, and Baochun Li., INFOCOM 2022
ARK: GPU-driven Code Execution for Distributed Deep Learning by Hwang, Changho, et al., NSDI 2023
On Modular Learning of Distributed Systems for Predicting {End-to-End} Latency by Liang, Chieh-Jan Mike, et al., NSDI 2023

Other Cool Ideas

Understanding and Optimizing Deep Learning Cold-Start Latency on Edge Devices by Yi, Rongjie, et al., arxiv 2022
Towards efficient vision transformer inference: a first study of transformers on mobile devices by Wang, Xudong, et al., HotMobile 2022
Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference by Tambe, Thierry, et al., MICRO 2021
EDGEWISE: A Better Stream Processing Engine for the Edge by Fu, Xinwei, et al., ATC 2019
LiteFlow: towards high-performance adaptive neural networks for kernel datapath by Zhang, Junxue, et al., SIGCOMM 2022
CoCoPIE: Making Mobile AI Sweet As PIE--Compression-Compilation Co-Design Goes a Long Way by Liu, Shaoshan, et al., arxiv 2020
Beyond Data and Model Parallelism for Deep Neural Networks by Jia, Zhihao, Matei Zaharia, and Alex Aiken, MLSys 2019
Discovering faster matrix multiplication algorithms with reinforcement learning by Fawzi, Alhussein, et al., Nature 2022
Gemel: Model Merging for {Memory-Efficient},{Real-Time} Video Analytics at the Edge by Padmanabhan, Arthi, et al., NSDI 2023
{RECL}: Responsive {Resource-Efficient} Continuous Learning for Video Analytics by Khani, Mehrdad, et al., NSDI 2023
Ekya: Continuous learning of video analytics models on edge compute servers by Bhardwaj, Romil, et al., NSDI 2022

About

This is a list of awesome edgeAI inference related papers.