Paper list for broad topics in machine learning systems
NOTE: Survey papers are annotated with [Survey π] prefix.
Table of Contents
- Paper List for Machine Learning Systems
- 1. Data Processing
- 2. Training System
- 2.1 DL scheduling
- 2.2 GPU sharing
- 2.3 GPU memory management and optimization
- 2.4 GPU memory usage estimate
- 2.5 Distributed training (Parallelism)
- 2.6 DL job failures
- 2.7 Model checkpointing
- 2.8 AutoML
- 2.9 Communication optimization
- 2.10 Energy-efficient DNN training (carbon-aware)
- 2.11 DNN compiler
- 2.12 Model pruning and compression
- 2.13 GNN training system
- 2.14 Congestion control for DNN training
- 3. Inference System
- 4. Federated Learning
- 5. Privacy-Preserving ML
- 6. ML APIs & Application-side Optimization
- 7. ML for Systems
- Others
- References
- [arxiv'24] cedar: Composable and Optimized Machine Learning Input Data Pipelines
- [MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
- [ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
- [SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
- [VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
- [VLDB'21] tf.data: A Machine Learning Data Processing Framework
- [VLDB'24] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
- [arxiv'23] Rinas: Training with Dataset Shuffling Can Be General and Fast
- [CVPR'23] FFCV: Accelerating Training by Removing Data Bottlenecks
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [SIGMOD'23] GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning
- [VLDB'23] FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
- [SoCC'23] tf.data service: A Case for Disaggregating ML Input Data Processing
- [ATC'22] Cachew: Machine Learning Input Data Processing as a Service
- [OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters
- [ICPP'19] DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines
- [TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
- [ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
- [SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O
- [arxiv'23] Towards Data-centric Graph Machine Learning: Review and Outlook
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [MLSys'23] RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
- [ASPLOS'22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [arxiv'23] MTrainS: Improving DLRM training efficiency using heterogeneous memories
- [SOSP'23] Bagpipe: Accelerating Deep Recommendation Model Training
- [SOSP'23] gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning
- [NSDI'23] BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
- [DAC'22] A Joint Management Middleware to Improve Training Performance of Deep Recommendation Systems with SSDs
- [VLDB'22] Accelerating Recommendation System Training by Leveraging Popular Choices
- [TPDS'23] High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
- [SOSP'23] UGACHE: A Unified GPU Cache for Embedding-based Deep Learning
- [ATC'23] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
- [EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 2.1]
- [FAST'23] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
- [HPCA'23] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
- [NeurIPS'22] A Deep Learning Dataloader with Shared Data Preparation
- [CLUSTER'22] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
- [ICDE'22] Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
- [ATC'21] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training
- [FAST'20] Quiver: An Informed Storage Cache for Deep Learning
- [ICPP'20] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
- [arXiv'19] Faster Neural Network Training with Data Echoing
- [HotCloud'19] The Case for Unifying Data Loading in Machine Learning Clusters
- [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
- [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data
- [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines
- [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision
-
[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
-
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
-
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
-
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
-
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
-
[Survey π] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
-
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
-
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
-
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
-
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
-
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
-
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
-
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
-
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
-
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
-
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
-
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
-
[Survey π] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
-
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
-
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (
PAI
) -
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (
Synergy
) -
[SIGCOMM'22] Multi-resource interleaving for deep learning training (
Muri
) -
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
-
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
-
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (
Helios
) -
[OSDI'21] Privacy Budget Scheduling (
DPF
) -
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (
AFS
) -
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
-
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (
GandivaFair
) -
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
-
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
-
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (
Gavel
) -
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
-
[MLSys'20] Resource Elasticity in Distributed Deep Learning
-
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
-
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (
Philly
) -
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
-
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning
- [EuroSys'24 (to appear)] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
- [ATC'23] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
- [NSDI'23] Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
- [ICPP'23] FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
- [arxiv'23] MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
- [SoCC'22] MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
- [PACT'22] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
- [ATC'21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
- [MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
- [OSDI'20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
- [OSDI'20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
- [ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (
QSDP
) - [arxiv'23] Does compressing activations help model parallel training?
- [SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
- [VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- [SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
- [HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
- [HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
- [IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
- [ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
- [VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
- [ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
- [ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
- [ICLR'21] Dynamic Tensor Rematerialization
- [SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
- [HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
- [MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
- [ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
- [ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
- [SC'20] ZeRO: memory optimizations toward training trillion parameter models
- [ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
- [PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
- [MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
- [arxiv'16] Training Deep Nets with Sublinear Memory Cost
- [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models
- [ICLR'24] Zero Bubble (Almost) Pipeline Parallelism
- [arxiv'24] BitDelta: Your Fine-Tune May Only Be Worth One Bit
- [arxiv'24] NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
- [arxiv'24] Accelerating Parallel Sampling of Diffusion Models
- [arxiv'24] Training DNN Models over Heterogeneous Clusters with Optimal Performance
- [NSDI'24] DISTMM: Accelerating Distributed Multi-modal Model Training
- [NSDI'24] Accelerating Neural Recommendation Training with Embedding Scheduling
- [NSDI'24] Resiliency at Scale: Managing Googleβs TPUv4 Machine Learning Supercomputer
- [NSDI'24] QuickUpdate: a Real-Time Personalization System for Large-Scale Recommendation Models
- [NSDI'24] Scaling Large Language Model Training to More Than 10,000 GPUs
- [arxiv'24] Breaking MLPerf Training: A Case Study on Optimizing BERT
- [ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
- [arxiv'24] LocMoE: A Low-overhead MoE for Large Language Model Training
- [arxiv'24] Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
- [AAMAS'24] Holonic Learning: A Flexible Agent-based Distributed Machine Learning Framework
- [arxiv'24] InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding
- [VLDB'24] Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
- [HPCA'24] Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
- [NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
- [EuroSys'24] HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
-
[ICPP'23] Mercury: Fast and Optimal Device Placement for Large Deep Learning Models
-
[arxiv'23] TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections
-
[arxiv'23] ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
-
[arxiv'23] FlexModel: A Framework for Interpretability of Distributed Large Language Models
-
[arxiv'23] Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
-
[arxiv'23] RTP: Rethinking Tensor Parallelism with Memory Deduplication
-
[arxiv'23] FP8-LM: Training FP8 Large Language Models
-
[arxiv'23] Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
-
[arxiv'23] FLM-101B: An Open LLM and How to Train It with $100K Budget
-
[arxiv'23] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
-
[arxiv'23] Improving Automatic Parallel Training via Balanced Memory Workload Optimization
- extended version of Galvatron (VLDB'23)
-
[arxiv'23] Modeling Parallel Programs using Large Language Models
-
[arxiv'23] Proteus: Simulating the Performance of Distributed DNN Training
-
[arxiv'23] Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training
-
[arxiv'23] Decoupled Model Schedule for Deep Learning Training
-
[arxiv'23] RAF: Holistic Compilation for Deep Learning Model Training
-
[arxiv'23] Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches
-
[arxiv'23] Does compressing activations help model parallel training?
-
[arxiv'23] Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
-
[arxiv'23] Scaling Vision Transformers to 22 Billion Parameters
-
[arxiv'23] Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
-
[arxiv'23] TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation
-
[arxiv'23] SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
-
[arxiv'23] ATP: Adaptive Tensor Parallelism for Foundation Models
-
[arxiv'23] AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication
-
[IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
-
[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
-
[NeurIPS'23] DeepPCR: Parallelizing Sequential Operations in Neural Networks
-
[DAC'23] MixPipe: Efficient Bidirectional Pipeline Parallelism for Training Large-Scale Models
-
[SC'23] Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
-
[SOSP'23] PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
-
[SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
-
[HPCA'23] Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism
-
[ACL'23] Sequence Parallelism: Long Sequence Training from System Perspective
-
[CCGrid'23] A Deep Learning Pipeline Parallel Optimization Method
-
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
-
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
-
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
-
[ATC'23] MSRL: Distributed Reinforcement Learning with Dataflow Fragments
-
[Survey π] [TPDS'23] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
-
[ICML'23] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
-
[ICML'23] BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
-
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
-
[NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
-
[NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
-
[NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
-
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
-
[MLSys'23] On Optimizing the Communication of Model Parallelism
-
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
-
[TPDS'23] Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
-
[PPoPP'23] Elastic Averaging for Efficient Pipelined DNN Training
-
[PPoPP'23] Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
-
[VLDB'23] MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
-
[VLDB'23] Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
-
[ASPLOS'23] Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
-
[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression
- [arxiv'22] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
- [arxiv'22] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
- [arxiv'22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
- [ICPP'22] Tesseract: Parallelize the Tensor Parallelism Efficiently
- [MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning
- [NeurIPS'22] Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees
- [SoCC'22] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
- [MLSys'22] Pathways: Asynchronous distributed dataflow for ML
- [MLSys'22] SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud
- [MLSys'22] Efficient Strong Scaling Through Burst Parallel Training
- [EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
- [ATC'22] Whale: Efficient Giant Model Training over Heterogeneous GPUs
- [NeurIPS'22] AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
- [PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
- [ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- [HPDC'22] Hare: Exploiting Inter-job and Intra-job Parallelism of Distributed Machine Learning on Heterogeneous GPUs
- [OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- [NSDI'22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks
- [arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
- [arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
- [JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- [TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
- [ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
- [SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
- [MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
- [ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- [NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
- [ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
- [ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
- [ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
- [SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
- [SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (
PTD-P
orMegatron-LM v2
) - [FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
- [PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
- [VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
- [HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
- [NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
- [arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- [KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
- [VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
- [OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (
BytePS
) - [SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
- [NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
- [arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]
- [HPCA'19] HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
- [IEEE MICRO'19] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
- [MLSys'19] Beyond data and model parallelism for deep neural networks (
FlexFlow
) - [MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
- [EuroSys'19] Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
- [EuroSys'19] Supporting Very Large Models using Automatic Dataflow Graph Partitioning (
Tofu
) - [SOSP'19] A Generic Communication Scheduler for Distributed DNN Training Acceleration
- [NeurIPS'19] Mesh-TensorFlow: Deep Learning for Supercomputers
- [NeurIPS'19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- [ICML'18] Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
- [Survey π] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
- [Survey π] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
- [Survey π] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools
- [ATC'22] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure
- [ICSE'20] An Empirical Study on Program Failures of Deep Learning Jobs
- [FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
- [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
- [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
- [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework
- [arxiv'24] MLTCP: Congestion Control for DNN Training
- [arxiv'24] Accelerating Distributed Deep Learning using Lossless Homomorphic Compression
- [NSDI'24] THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
- [arxiv'23] TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training
- [ICML'23] CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
- Related to DT-FM (NeurIPS'22)
- [IPDPS'23] MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
- [ASPLOS'23] MSCCLang: Microsoft Collective Communication Language
- [ASPLOS'23] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
- [EuroSys'23] A2TP: Aggregator-aware In-network Aggregation for Multi-tenant Learning
- [EuroSys'23] Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
- [MLSys'23] Cupcake: A Compression Optimizer for Scalable Communication-Efficient Distributed Training
- [MLSys'23] On Optimizing the Communication of Model Parallelism
- [NSDI'23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
- [NSDI'23] TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
- [ISCA'22] Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models
- [SC'22] HammingMesh: A Network Topology for Large-Scale Deep Learning
- [PPoPP'22] Near-optimal sparse allreduce for distributed deep learning
- [MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning (
P^2
) - [SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.5]
- [SC'21] Flare: flexible in-network allreduce
- [NSDI'21] Scaling Distributed Machine Learning with In-Network Aggregation
- [ISCA'21] Enabling compute-communication overlap in distributed deep learning training platforms
- [PPoPP'21] Synthesizing optimal collective algorithms (
SCCL
) - [PPoPP'20] Taming unbalanced training workloads in deep learning with partial collective operations
- [MLSys'20] Blink: Fast and Generic Collectives for Distributed ML
- [MLSys'20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training
- [MLSys'19] Priority-based Parameter Propagation for Distributed DNN Training (
P3
) - [MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
- [SOSP'19] A generic communication scheduler for distributed DNN training acceleration (
ByteScheduler
) - [ATC'17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
- [arxiv'23] Perseus: Removing Energy Bloat from Large Model Training
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [ATC'23] EnvPipe: Performance-preserving DNN Training Framework for Saving Energy
- [NSDI'23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
- [OSDI'23] Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
- [OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
- [OSDI'23] Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
- [OSDI'23] EINNET: Optimizing Tensor Programs with Derivation-Based Transformations
- [OSDI'22] ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
- [OSDI'20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
- [OSDI'20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
- [ASPLOS'20] FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
- [OSDI'18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- [ACL'23] Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
- [ICLR'23] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- [OSDI'23] AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
- [ICML'22] TSPipe: Learn from Teacher Faster with Pipelines
For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.
- [VLDB'24] NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams
- [arxiv'23] ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training
- [arxiv'23] Helios: An Efficient Out-of-core GNN Training System on Terabyte-scale Graphs with In-memory Performance
- [arxiv'23] GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism
- [MLSys'23] Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
- [SIGMOD'23] DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
- [OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
- [EuroSys'23] MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks
- [KDD'22] Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs
- [VLDB'22] TGL: a general framework for temporal GNN training on billion-scale graphs
- [OSDI'21] P3: Distributed Deep Graph Learning at Scale
- [arxiv'24] MLTCP: Congestion Control for DNN Training
- [HotNets'22] Congestion Control in Machine Learning Clusters
- [arxiv'24] Wisdom of Committee: Distilling from Foundation Model to SpecializedApplication Model
- [arxiv'24] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
- [PPoPP'24 poster] POSTER: LLM-PQ:Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
- [NSDI'24] Approximate Caching for Efficiently Serving Diffusion Models
- [NSDI'24] Characterization of Large Language Model Development in the Datacenter
- [arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
- [arxiv'24] ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
- [arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
- [arxiv'24] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
- [arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
- [arxiv'24] CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
- [arxiv'24] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
- [arxiv'24] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- [arxiv'24] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
- [Survey π] [arxiv'24] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
- [arxiv'24] Learned Best-Effort LLM Serving
- [arxiv'24] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- [VLDB'24] Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
- [ASPLOS'24] SpotServe: Serving Generative Large Language Models on Preemptible Instances
- [arxiv'23] Splitwise: Efficient generative LLM inference using phase splitting
- [arxiv'23] SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification
- [arxiv'23] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
- [arxiv'23] Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
- [arxiv'23] Fairness in Serving Large Language Models
- [arxiv'23] HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment
- [arxiv'23] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
- [arxiv'23] Punica: Multi-Tenant LoRA Serving
- [arxiv'23] Pipeline Parallelism for DNN Inference with Practical Performance Guarantees
- [arxiv'23] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- [arxiv'23] High-throughput Generative Inference of Large Language Models with a Single GPU
- [HPDC'23] Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
- [SOSP'23] Paella: Low-latency Model Serving with Virtualized GPU Scheduling
- [SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
- [MLSys'23] Efficiently Scaling Transformer Inference
- [EuroSys'23] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access
- [EuroSys'23] Tabi: An Efficient Multi-Level Inference System for Large Language Models
- [EuroSys'23] Pocket: ML Serving from the Edge
- [OSDI'23] AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- [NSDI'23] SHEPHERD: Serving DNNs in the Wild
- [VLDB'23] Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
- [ICML'23] Fast Inference from Transformers via Speculative Decoding
- [SIGMOD'22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving
- [OSDI'22] Orca: A Distributed Serving System for Transformer-Based Generative Models
- [OSDI'22] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
- [ATC'22] SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
- [ATC'22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
- [ATC'22] Tetris: Memory-efficient Serverless Inference through Tensor Sharing
- [ATC'22] PetS: A Unified Framework for Parameter-Efficient Transformers Serving
- [ATC'21] INFaaS: Automated Model-less Inference Serving
- [SoCC'21] Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving
- [arxiv'21] Supporting Massive DLRM Inference through Software Defined Memory
- [MobiCom'20] SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud
- [SAC'24] Training Heterogeneous Client Models using Knowledge Distillation in Serverless Federated Learning
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [arxiv'23] Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
- [IMWUT'23] AttFL: A Personalized Federated Learning Framework for Time-series Mobile and Embedded Sensor Data Processing
- [Survey π] [FGCS'23] Model aggregation techniques in federated learning: A comprehensive survey
- [SoCC'23] Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering
- [MLSys'23] GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning
- [WWW'23] To Store or Not? Online Data Selection for Federated Learning with Limited Storage
- [EuroSys'23] REFL: Resource-Efficient Federated Learning
- [VLDB'23] FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
- [RecSys'22] Towards Fair Federated Recommendation Learning: Characterizing the Inter-Dependence of System and Data Heterogeneity
- [TMLR'22] Optimal Client Sampling for Federated Learning
- [ICML'22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
- [MobiSys'22] FedBalancer: data and pace control for efficient federated learning on heterogeneous clients
- [MobiCom'22] PyramidFL: A Fine-grained Client Selection Framework for Efficient Federated Learning
- [MLSys'22] PAPAYA: Practical, Private, and Scalable Federated Learning
- [AISTATS'22] Federated Learning with Buffered Asynchronous Aggregation
- [NeurIPS'21] Federated Reconstruction: Partially Local Federated Learning
- [NeurIPS'21] FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout
- [OSDI'21] Oort: Efficient Federated Learning via Guided Participant Selection
- [MICRO'21] AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning
- [MLSys'19] Towards Federated Learning at Scale: System Design
- [Survey π] [ACM CSUR'22] Federated Learning for Smart Healthcare: A Survey
- [DAC'23] Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators
- [ICLR'23] MPCFormer: fast, performant and private Transformer inference with MPC
- [NeurIPS'22] Iron: Private Inference on Transformers
- [arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
- [OSDI'24 (to appear)] Automatic and Efficient Customization of Neural Networks for ML Applications
- [ICML'22] Efficient Online ML API Selection for Multi-Label Classification Tasks (
FrugalMCT
) - [NeurIPS'20] FrugalML: How to use ML Prediction APIs more accurately and cheaply
- [arxiv'24] Large Language Model Adaptation for Networking
- [arxiv'24] LLM-Enhanced Data Management
- [arxiv'24] MPIrigen: MPI Code Generation through Domain-Specific Language Models
- [arxiv'24] Can Large Language Models Write Parallel Code?
- [arxiv'23] LLM-Assisted Code Cleaning For Training Accurate Code Generators
- [arxiv'23] Large Language Models for Compiler Optimization
- [VLDB'23] How Large Language Models Will Disrupt Data Management
- [arxiv'24] You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- [arxiv'24] Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
- [Survey π] [arxiv'24] A Survey of Resource-efficient LLM and Multimodal Foundation Models
- [arxiv'23] Efficiently Programming Large Language Models using SGLang
- [MICRO'23] Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads
This repository is motivated by:
- https://github.com/HuaizhengZhang/Awesome-System-for-Machine-Learning
- https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
- https://github.com/ganler/ResearchReading
- https://jeongseob.github.io/readings_mlsys.html
- https://github.com/chwan1016/awesome-gnn-systems
- https://github.com/ConnollyLeon/awesome-Auto-Parallelism