This is a list of awesome real-time AI and DNN inference related projects & papers.
- Edge Intelligence: Architectures, Challenges, and Applications by Xu, Dianlei, et al., arxiv 2020
- A Survey of Multi-Tenant Deep Learning Inference on GPU by Yu, Fuxun, et al., arxiv 2022
- Machine Learning in Real-Time Internet of Things (IoT) Systems: A Survey by Bian, Jiang, et al., IOTJ 2022
- AI Augmented Edge and Fog Computing: Trends and Challenges by Tuli S, Mirhakimi F, Pallewatta S, et al., arxiv 2022
- Enable deep learning on mobile devices: Methods, systems, and applications by Cai, Han, et al., TODAES 2022
- Multi-DNN Accelerators for Next-Generation AI Systems by Venieris, Stylianos I., Christos-Savvas Bouganis, and Nicholas D. Lane., arxiv 2022
- A Survey of GPU Multitasking Methods Supported by Hardware Architecture Zhao, Chen, et al., IEEE TPDS 2021
- The Future of Consumer Edge-AI Computing by Laskaridis, Stefanos, et al., arxiv 2022
- DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack by Gibson, Perry, et al., arxiv 2023
- Moses: Exploiting Cross-device Transferable Features for On-device Tensor Program Optimization by Zhao, Zhihe, et al., HotMobile 2023
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
- AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures by Zhen Zheng et al., ASPLOS 2022
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections by Haojie Wang et al., OSDI 2021
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks by Lingxiao Ma et al., OSDI 2020
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
- Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance by Jiarong Xing et al., MLSys 2022
- Ansor: Generating High-Performance Tensor Programs for Deep Learning by Lianmin Zheng et al., OSDI 2020
- TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers by Lianmin Zheng., NeurIPS 2021
- Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs by Liang, Rendong, et al., MobiCom 2022
- Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus by Wang, Manni, et al., MobiCom 2021
- Ios: Inter-operator scheduler for cnn acceleration by Ding, Yaoyao, et al., MLSys 2021
- DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads by Jung, Wookeun, Thanh Tuan Dao, and Jaejin Lee., PLDI 2021
- CASE: a compiler-assisted SchEduling framework for multi-GPU systems by Chen, Chao, Chris Porter, and Santosh Pande., PPoPP 2022
- Chameleon: Adaptive code optimization for expedited deep neural network compilation by Ahn, Byung Hoon, et al., arxiv 2020
- Analytical characterization and design space exploration for optimization of CNNs by Li, Rui, et al., ASPLOS 2021
- DNNFusion: accelerating deep neural networks execution with advanced operator fusion by Niu, Wei, et al., PLDI 2021
- AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU by Bai, Yang, et al., ICCAD 2021
- DietCode: Automatic Optimization for Dynamic Tensor Programs by Zheng, Bojian, et al., MLSys 2022
- ROLLER: Fast and Efficient Tensor Compilation for Deep Learning by Zhu, Hongyu, et al., OSDI 2022
- FamilySeer: Towards Optimized Tensor Codes by Exploiting Computation Subgraph Similarity by Zhang, Shanjun, et al., arxiv 2022
- Reusing Auto-Schedules for Efficient DNN Compilation by Gibson, Perry, and José Cano., arxiv 2022
- Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs by Ding, Yaoyao, et al., arxiv 2022
- Cortex: A Compiler for Recursive Deep Learning Models by Fegade, Pratik, et al., MLSys 2021
- SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction by Lin, Zhiqi, et al., arxiv 2023
- Seastar: Vertex-Centric Programming for Graph Neural Networks by Wu, Yidi, et al., EuroSys 2021
- On Optimizing the Communication of Model Parallelism by Zhuang, Yonghao, et al., MLSys 2023
- ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations by Xu, Zhiying, et al., arxiv 2022
- AGO: Boosting Mobile AI Inference Performance by Removing Constraints on Graph Optimization by Xu, Zhiying, Hongding Peng, and Wei Wang., INFOCOM 2023
- Enabling Data Movement and Computation Pipelining in Deep Learning Compiler by Huang, Guyue, et al., MLSys 2023
- Automatic Horizontal Fusion for GPU Kernels by Li, Ao, et al., CGO 2022
- Compiler Framework for Optimizing Dynamic Parallelism on GPUs by Olabi, Mhd Ghaith, et al., CGO 2022
- Transfer-Tuning: Reusing Auto-Schedules for Efficient Tensor Program Code Generation by Gibson, Perry, and José Cano., PACT 2022
- Nnsmith: Generating diverse and valid test cases for deep learning compilers by Liu, Jiawei, et al., ASPLOS 2023
- Codon: A Compiler for High-Performance Pythonic Applications and DSLs by Shajii, Ariya, et al., CC 2023
- CMLCompiler: A Unified Compiler for Classical Machine Learning by Wen, Xu, et al., arxiv 2023
- VeGen: A Vectorizer Generator for SIMD and Beyond by Chen, Yishen, et al., ASPLOS 2021
- Composable and Modular Code Generation in MLIR by Vasilache, Nicolas, et al., arxiv 2022
- TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment by Liu, Hsin-I. Cindy, et al., arxiv 2022
- High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR Some Early Results by Katel, Navdeep, Vivek Khandelwal, and Uday Bondhugula., arxiv 2021
- Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform by Zhang, Shiwei, et al., arxiv 2023
- Triton: an intermediate language and compiler for tiled neural network computations by Tillet, Philippe, Hsiang-Tsung Kung, and David Cox., SIGPLAN Workshop 2019
- Flashattention: Fast and memory-efficient exact attention with io-awareness by Dao, Tri, et al., NeurIPS 2022
- Graphene: An IR for Optimized Tensor Computations on GPUs by Hagedorn, Bastian, et al., ASPLOS 2023
- Tensorir: An abstraction for automatic tensorized program optimization by Feng, Siyuan, et al., ASPLOS 2023
- SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning by Ye, Zihao, et al., ASPLOS 2023
- Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators by Bi, Jun, et al., ASPLOS 2023
- Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system by Zheng, Size, et al., ASPLOS 2020
- AutoMap: Automatic Mapping of Neural Networks to Deep Learning Accelerators for Edge Devices by Wang, Yanhong, et al., TCAD 2022
- Optimizing Dynamic Neural Networks with Brainstorm by Cui, Weihao, et al., OSDI 2023
- EINNET: Optimizing Tensor Programs with Derivation-Based Transformations by Zheng, Liyan, et al., OSDI 2023
- Welder: Scheduling Deep Learning Memory Access via Tile-graph by Shi, Yining, et al., OSDI 2023
- TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs by Mangpo, Phitchaya, et al., arxiv 2023
- Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation by Verma, Gaurav, et al., arxiv 2023
- Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their {Domain-Specific} Accelerators by Zhao, Jie, et al., arxiv 2023
- Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations by Jia, Fucheng, et al. arxiv 2023
- Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning by Zhang, Chen, et al., OSDI 2023
- Revisiting the Evaluation of Deep Learning-Based Compiler Testing by Tian, Yongqiang, et al., IJCAI 2023
- Learning Compiler Pass Orders using Coreset and Normalized Value Prediction by Liang, Youwei, et al., ICML 2023
- Learning to make compiler optimizations more effective by Mammadli, Rahim, et al. PLDI 2021
- C2TACO: Lifting Tensor Code to TACO by Magalhães, José Wesley de Souza, et al., GPCE 2023
- TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs by Phothilimthana, Phitchaya Mangpo, et al., arxiv 2023
- Autotuning convolutions is easier than you think by Tollenaere, Nicolas, et al., ACM TACO 2023
- DnD: A Cross-Architecture Deep Neural Network Decompiler by Wu, Ruoyu, et al., USENIX Security 22
- Decompiling x86 Deep Neural Network Executables by Liu, Zhibo, et al., USENIX Security 23
- LibSteal: Model Extraction Attack towards Deep Learning Compilers by Reversing DNN Binary Library by Zhang, Jinquan, Pei Wang, and Dinghao Wu., ENASE 2023
- Reverse engineering convolutional neural networks through side-channel information leaks by Hua, Weizhe, Zhiru Zhang, and G. Edward Suh., DAC 2018
- Cache telepathy: Leveraging shared resource attacks to learn {DNN} architectures by Yan, Mengjia, Christopher W. Fletcher, and Josep Torrellas., USENIX Security 2020
- Knockoff nets: Stealing functionality of black-box models by Orekondy, Tribhuvanesh, Bernt Schiele, and Mario Fritz., CVPR 2019
- Deepsniffer: A dnn model extraction framework based on learning architectural hints by Hu, Xing, et al., ASPLOS 2020
- I know what you trained last summer: A survey on stealing machine learning models and defences by Oliynyk, Daryna, Rudolf Mayer, and Andreas Rauber., ACM Computing Surveys 2023
- Deepsteal: Advanced model extractions leveraging efficient weight stealing in memories by Rakin, Adnan Siraj, et al., S&P 2022
- Hermes attack: Steal {DNN} models with lossless inference accuracy by Zhu, Yuankun, et al., USENIX Security 2021
- Stealing machine learning models via prediction {APIs} by Tramèr, Florian, et al., USENIX Security 2016
- SoK: Demystifying Binary Lifters Through the Lens of Downstream Applications by Liu, Zhibo, et al., S&P 2022
- HuffDuff: Stealing Pruned DNNs from Sparse Accelerators by Yang, Dingqing, Prashant J. Nair, and Mieszko Lis., ASPLOS 2023
- EdgeML: An AutoML framework for real-time deep learning on the edge by Zhao, Zhihe, et al., IoTDI 2021
- SPINN: synergistic progressive inference of neural networks over device and cloud by Laskaridis, Stefanos, et al., MobiCom 2020
- Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud by Huang, Jin, et al., MobiCom 2020
- Neurosurgeon: Collaborative intelligence between the cloud and mobile edge by Kang, Yiping, et al., ASPLOS 2017
- Mistify: Automating DNN Model Porting for On-Device Inference at the Edge by Guo, Peizhen, et al., NSDI 2021
- Deep compressive offloading: Speeding up neural network inference by trading edge computation for network latency. by Yao, Shuochao, et al., SenSys 2020
- Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading by Zhang, Wuyang, et al., MobiCom 2021
- Edge assisted real-time object detection for mobile augmented reality by Liu, Luyang, Hongyu Li, and Marco Gruteser., MobiCom 2019
- Mistify: Automating dnn model porting for on-device inference at the edge by Guo, Peizhen, Bo Hu, and Wenjun Hu., NSDI 2021
- AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments by Wen, Hao, et al., MobiCom 2023
- VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling by Liu, Zihan, et al., ASPLOS 2021
- RT-mDL: Supporting Real-Time Mixed Deep Learning Tasks on Edge Platforms by Ling, Neiwen, et al., SenSys 2021
- Horus: Interference-aware and prediction-based scheduling in deep learning systems by Yeung, Gingfung, et al., IEEE TPDS 2021
- Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU by Yu, Fuxun, et al., ICCAD 2021
- Interference-aware scheduling for inference serving by Mendoza, Daniel, et al., EuroMLSys 2021
- Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences by Han, Mingcong, et al., OSDI 2022
- Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks by Ghodrati, Soroush, et al., MICRO 2020
- Heimdall: mobile GPU coordination platform for augmented reality applications by Yi, Juheon, and Youngki Lee., MobiCom 2020
- Deepeye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware by Mathur, Akhil, et al., MobiSys 2017
- PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications by Bai, Zhihao, et al., OSDI 2020
- Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction by Cui, Weihao, et al., SC 2021
- LegoDNN: block-grained scaling of deep neural networks for mobile vision by Han, Rui, et al., MobiCom 2021
- NeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems by Bateni, Soroush, and Cong Liu., ATC 2020
- Multi-Neural Network Acceleration Architecture by Baek, Eunjin, Dongup Kwon, and Jangwoo Kim., ISCA 2020
- Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference by Xiang, Yecheng, and Hyoseung Kim., RTSS 2019
- Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision by Fang, Biyi, Xiao Zeng, and Mi Zhang., MobiCom 2018
- Flep: Enabling flexible and efficient preemption on gpus by Wu, Bo, et al., ASPLOS 2017
- Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers by Chen, Quan, et al., ASPLOS 2017
- PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences by Zhang, Shulai, et al., ICS 2022
- Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling by Oh, Young H., et al., HPCA 2021
- LiteReconfig: cost and content aware reconfiguration of video object detection systems for mobile GPUs by Xu, Ran, et al., EuroSys 2022
- ApproxNet: Content and contention-aware video object classification system for embedded clients by Xu, Ran, et al.
- Accelerating deep learning workloads through efficient multi-model execution by Narayanan, Deepak, et al., NeurIPS Workshop 2018
- OLPart: Online Learning based Resource Partitioning for Colocating Multiple Latency-Critical Jobs on Commodity Computers by Chen, Ruobing, et al., EuroSys 2023
- MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks by Kim, Seah, et al., HPCA 2023
- Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices by Sung, Hsin-Hsuan, et al., ATC 2023
- Lalarand: Flexible layer-by-layer cpu/gpu scheduling for real-time dnn tasks by Kang, Woosung, et al., RTSS 2021
- DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture by Zhang, Minjia, Zehua Hu, and Mingqin Li., IPDPS 2021
- Band: coordinated multi-DNN inference on heterogeneous mobile processors by Jeong, Joo Seong, et al., MobiSys 2022
- ODMDEF: On-Device Multi-DNN Execution Framework Utilizing Adaptive Layer-Allocation on General Purpose Cores and Accelerator by Lim, Cheolsun, and Myungsun Kim., IEEE ACCESS 2021
- μlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization by Kim, Youngsok, et al., EuroSys 2019
- OPTiC: Optimizing collaborative CPU–GPU computing on mobile devices with thermal constraints by Wang, Siqi, Gayathri Ananthanarayanan, and Tulika Mitra., TCAD 2019
- Accelerating Sequence-to-Graph Alignment on Heterogeneous Processors by Feng, Zonghao, and Qiong Luo., ICPP 2021
- Efficient Execution of Deep Neural Networks on Mobile Devices with NPU by Tan, Tianxiang, and Guohong Cao., IPSN 2021
- CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices by Jia, Fucheng, et al., MobiSys 2022
- Coda: Improving resource utilization by slimming and co-locating dnn and cpu jobs by Zhao, Han, et al. ICDCS 2020
- GPUReplay: a 50-KB GPU stack for client ML by Park, Heejin, and Felix Xiaozhu Lin., ASPLOS 2022
- Real-time high performance computing using a Jetson Xavier AGX by Cetre, Cyril, et al., ERTS 2022
- GPU scheduling on the NVIDIA TX2: Hidden details revealed by Amert, Tanya, et al., RTSS 2017
- Nimble: Lightweight and parallel gpu task scheduling for deep learning by Kwon, Woosuk, et al., NeurIPS 2020
- Addressing GPU on-chip shared memory bank conflicts using elastic pipeline by Gou, Chunyang, and Georgi N. Gaydadjiev., IJPP 2013
- A study of persistent threads style GPU programming for GPGPU workloads by Gupta, Kshitij, Jeff A. Stuart, and John D. Owens., IEEE 2012
- Demystifying the placement policies of the NVIDIA GPU thread block scheduler for concurrent kernels by Gilman, Guin, et al., ACM SIGMETRICS Performance Evaluation Review 2021
- Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks by Zhao, Han, et al., ICDC 2021
- Online Thread Auto-Tuning for Performance Improvement and Resource Saving by Luan, Guangqiang, et al., IEEE TPDS 2021
- Hsm: A hybrid slowdown model for multitasking gpus by Zhao, Xia, Magnus Jahre, and Lieven Eeckhout., ASPLOS 2020
- Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations by Wu, Bo, et al., ACM ICS 2015
- Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming by Xu, Qiumin, et al., ISCA 2016
- Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling by Zhong, Jianlong, and Bingsheng He. IEEE TPDS 2013
- Improving GPGPU concurrency with elastic kernels by Pai, Sreepathi, Matthew J. Thazhuthaveetil, and Ramaswamy Govindarajan., ACM SIGARCH Computer Architecture News 2013
- Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Kayıran, Onur, et al. ICPCT 2013
- Orion: A framework for gpu occupancy tuning by Hayes, Ari B., et al., International Middleware Conference. 2016
- Efficient performance estimation and work-group size pruning for OpenCl kernels on GPUs by Wang, Xiebing, et al., IEEE TPDS 2019
- Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters by Bian, Zhengda, et al., SC 2021
- Autotuning GPU kernels via static and predictive analysis by Lim, Robert, Boyana Norris, and Allen Malony., IEEE ICPP 2017
- Gslice: controlled spatial sharing of gpus for a scalable inference platform by Dhakal, Aditya, Sameer G. Kulkarni, and K. K. Ramakrishnan., SOCC 2020
- Fractional GPUs: Software-based compute and memory bandwidth reservation for GPUs by Jain, Saksham, et al., RTAS 2019
- Effisha: A software framework for enabling effficient preemptive scheduling of gpu by Chen, Guoyang, et al., PPoPP 2017
- Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs by Mukunoki, Daichi, Toshiyuki Imamura, and Daisuke Takahashi., MCSOC 2016
- FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs by López-Albelda, Bernabé, et al., The Journal of Supercomputing 2022
- Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing Wang, Zhenning, et al., HPCA 2016
- Optimum: Runtime Optimization for Multiple Mixed Model Deployment Deep Learning Inference by Kaicheng, Guo, et al., preprint 2022
- Exploring AMD GPU scheduling details by experimenting with “worst practices” by Otterness, Nathan, and James H. Anderson., RTNS 2021
- Making Powerful Enemies on NVIDIA GPUs by Yandrofski, Tyler, et al., RTSS 2022
- Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads by Zahaf, Houssam-Eddine, et al., RTNS 2021
- PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications by Bai, Zhihao, et al., OSDI 2020
- Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent by Weng, Qizhen, et al., ATC 2023
- VectorVisor: A Binary Translation Scheme for Throughput-Oriented GPU Acceleration by Ginzburg, Samuel, Mohammad Shahrad, and Michael J. Freedman., ATC 2023
- Arbitor: A Numerically Accurate Hardware Emulation Tool for DNN Accelerators by Jiang, Chenhao, et al., ATC 2023
- MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge by Abbasi, Saad, Alexander Wong, and Mohammad Javad Shafiee., arxiv 2022
- MAPLE-Edge: A Runtime Latency Predictor for Edge Devices by Nair, Saeejith, et al., CVPR 2022
- Maple: Microprocessor a priori for latency estimation by Abbasi, Saad, Alexander Wong, and Mohammad Javad Shafiee., CVPR 2022
- nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices by Zhang, Li Lyna, et al., MobiSys 2021
- Predicting and reining in application-level slowdown on spatial multitasking GPUs by Wei, Mengze, et al., JPDC 2020
- A model-based software solution for simultaneous multiple kernels on GPUs by Wu, Hao, et al., TACO 2020
- Smcompactor: a workload-aware fine-grained resource management framework for gpgpus by Chen, Qichen, et al., SAC 2021
- Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training by Geoffrey, X. Yu, et al., ATC 2021
- Mcunet: Tiny deep learning on iot devices by Lin, Ji, et al. , NeurIPS 2020
- TinyML: Current Progress, Research Challenges, and Future Roadmap by Shafique, Muhammad, et al., DAC 2021
- Benchmarking TinyML systems: Challenges and direction by Banbury, Colby R., et al., arxiv 2020
- μNAS: Constrained Neural Architecture Search for Microcontrollers by Liberis, Edgar, Łukasz Dudziak, and Nicholas D. Lane., EuroMLSys 2021
- Memory-efficient Patch-based Inference for Tiny Deep Learning by Lin, Ji, et al., NeurIPS 2021
- Deep Learning on Microcontrollers: A Study on Deployment Costs and Challenge by Filip Svoboda, Javier Fernandez-Marques, Edgar Liberis, Nicholas D Lane, EuroMLSys 2022
- Space-Efficient TREC for Enabling Deep Learning on Microcontrollers by Liu, Jiesong, et al., ASPLOS 2023
- Yono: Modeling multiple heterogeneous neural networks on microcontrollers by Kwon, Young D., Jagmohan Chauhan, and Cecilia Mascolo, IPSN 2022
- Dynamic Multimodal Fusion by Xue, Zihui, and Radu Marculescu., arxiv 2022
- LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action by Shah, Dhruv, et al., arxiv 2022
- Accelerating mobile audio sensing algorithms through on-chip gpu offloading by Georgiev, Petko, et al., MobiSys 2017
- Enabling Edge Devices that Learn from Each Other: Cross Modal Training for Activity Recognition by Xing, Tianwei, et al., EdgeSys 2018
- SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute by Zheng, Ningxin, et al., OSDI 2022
- ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition by Li, Shiyu, et al., MICRO 2021
- A high-performance sparse tensor algebra compiler in Multi-Level IR by Tian, Ruiqin, et al., arxiv 2021
- Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction by Huang, Guyue, et al., arxiv 2021
- COEXE: An Efficient Co-execution Architecture for Real-Time Neural Network Services by Liu, Chubo, et al., DAC 2020
- TorchSparse: Efficient Point Cloud Inference Engine by Tang, Haotian, et al., MLSys 2022
- SecureTVM: A TVM-Based Compiler Framework for Selective Privacy-Preserving Neural Inference by Huang, Po-Hsuan, et al., TODAES 2023
- PolyMPCNet: Towards ReLU-free Neural Architecture Search in Two-party Computation Based Private Inference by Peng, Hongwu, et al., arxiv 2023
- Cheetah: Lean and Fast Secure Two-Party Deep Neural Network Inference by Huang, Zhicong, et al., IACR Cryptol 2022
- Exploring Collaborative Distributed Diffusion-Based AI-Generated Content (AIGC) in Wireless Networks by Du, Hongyang, et al., arxiv 2023
- Distributed inference with deep learning models across heterogeneous edge devices by Hu, Chenghao, and Baochun Li., INFOCOM 2022
- ARK: GPU-driven Code Execution for Distributed Deep Learning by Hwang, Changho, et al., NSDI 2023
- On Modular Learning of Distributed Systems for Predicting {End-to-End} Latency by Liang, Chieh-Jan Mike, et al., NSDI 2023
- Understanding and Optimizing Deep Learning Cold-Start Latency on Edge Devices by Yi, Rongjie, et al., arxiv 2022
- Towards efficient vision transformer inference: a first study of transformers on mobile devices by Wang, Xudong, et al., HotMobile 2022
- Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference by Tambe, Thierry, et al., MICRO 2021
- EDGEWISE: A Better Stream Processing Engine for the Edge by Fu, Xinwei, et al., ATC 2019
- LiteFlow: towards high-performance adaptive neural networks for kernel datapath by Zhang, Junxue, et al., SIGCOMM 2022
- CoCoPIE: Making Mobile AI Sweet As PIE--Compression-Compilation Co-Design Goes a Long Way by Liu, Shaoshan, et al., arxiv 2020
- Beyond Data and Model Parallelism for Deep Neural Networks by Jia, Zhihao, Matei Zaharia, and Alex Aiken, MLSys 2019
- Discovering faster matrix multiplication algorithms with reinforcement learning by Fawzi, Alhussein, et al., Nature 2022
- Gemel: Model Merging for {Memory-Efficient},{Real-Time} Video Analytics at the Edge by Padmanabhan, Arthi, et al., NSDI 2023
- {RECL}: Responsive {Resource-Efficient} Continuous Learning for Video Analytics by Khani, Mehrdad, et al., NSDI 2023
- Ekya: Continuous learning of video analytics models on edge compute servers by Bhardwaj, Romil, et al., NSDI 2022