Paper List for Machine Learning Systems

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

Table of Contents

Paper List for Machine Learning Systems
References

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

[arxiv'24] cedar: Composable and Optimized Machine Learning Input Data Pipelines
[MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
[ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
[SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
[VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
[VLDB'21] tf.data: A Machine Learning Data Processing Framework

1.1.2 Prep stalls

[VLDB'24] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
[arxiv'23] Rinas: Training with Dataset Shuffling Can Be General and Fast
[CVPR'23] FFCV: Accelerating Training by Removing Data Bottlenecks
[RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
[SIGMOD'23] GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning
[VLDB'23] FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
[SoCC'23] tf.data service: A Case for Disaggregating ML Input Data Processing
- arxiv version
[ATC'22] Cachew: Machine Learning Input Data Processing as a Service
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters
[ICPP'19] DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines

1.1.3 Fetch stalls (I/O)

[TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
[ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
[SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O

1.1.4 Specific workloads (GNN, DLRM)

[arxiv'23] Towards Data-centric Graph Machine Learning: Review and Outlook
[arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
[MLSys'23] RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
[ASPLOS'22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
[RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
[arxiv'23] MTrainS: Improving DLRM training efficiency using heterogeneous memories
[SOSP'23] Bagpipe: Accelerating Deep Recommendation Model Training
[SOSP'23] gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning
[NSDI'23] BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
[DAC'22] A Joint Management Middleware to Improve Training Performance of Deep Recommendation Systems with SSDs
[VLDB'22] Accelerating Recommendation System Training by Leveraging Popular Choices

1.2 Caching and Distributed storage for ML training

[TPDS'23] High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
[SOSP'23] UGACHE: A Unified GPU Cache for Embedding-based Deep Learning
[ATC'23] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 2.1]
[FAST'23] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
[HPCA'23] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
[NeurIPS'22] A Deep Learning Dataloader with Shared Data Preparation
[CLUSTER'22] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
[ICDE'22] Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
[ATC'21] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training
[FAST'20] Quiver: An Informed Storage Cache for Deep Learning
[ICPP'20] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
[arXiv'19] Faster Neural Network Training with Data Echoing
[HotCloud'19] The Case for Unifying Data Loading in Machine Learning Clusters

1.3 Data formats

[ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
[VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

1.4 Data pipeline fairness and correctness

[CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

1.5 Data labeling automation

[VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

2. Training System

2.1 DL scheduling

[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
[Survey 🔍] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
[Survey 🔍] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (Synergy)
[SIGCOMM'22] Multi-resource interleaving for deep learning training (Muri)
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (Helios)
[OSDI'21] Privacy Budget Scheduling (DPF)
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (AFS)
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (GandivaFair)
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (Gavel)
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
[MLSys'20] Resource Elasticity in Distributed Deep Learning
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning

2.2 GPU sharing

[EuroSys'24 (to appear)] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
[ATC'23] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
[NSDI'23] Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
[ICPP'23] FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
[arxiv'23] MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
[SoCC'22] MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
[PACT'22] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
[ATC'21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
[MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
[OSDI'20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
[OSDI'20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications

2.3 GPU memory management and optimization

[ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
[arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
[arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (QSDP)
[arxiv'23] Does compressing activations help model parallel training?
[SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
[VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
[HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
[HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
[IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
[ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
[VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
[ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
[ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
[ICLR'21] Dynamic Tensor Rematerialization
[SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
[HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
[MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
[ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
[ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
[SC'20] ZeRO: memory optimizations toward training trillion parameter models
[ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
[PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
[MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
[arxiv'16] Training Deep Nets with Sublinear Memory Cost

2.4 GPU memory usage estimate

[ESEC/FSE'20] Estimating GPU memory consumption of deep learning models

2.5 Distributed training (Parallelism)

2024

[ICLR'24] Zero Bubble (Almost) Pipeline Parallelism
[arxiv'24] BitDelta: Your Fine-Tune May Only Be Worth One Bit
[arxiv'24] NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
[arxiv'24] Accelerating Parallel Sampling of Diffusion Models
[arxiv'24] Training DNN Models over Heterogeneous Clusters with Optimal Performance
[NSDI'24] DISTMM: Accelerating Distributed Multi-modal Model Training
[NSDI'24] Accelerating Neural Recommendation Training with Embedding Scheduling
[NSDI'24] Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
[NSDI'24] QuickUpdate: a Real-Time Personalization System for Large-Scale Recommendation Models
[NSDI'24] Scaling Large Language Model Training to More Than 10,000 GPUs
[arxiv'24] Breaking MLPerf Training: A Case Study on Optimizing BERT
[ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
- arxiv openreview
[arxiv'24] LocMoE: A Low-overhead MoE for Large Language Model Training
[arxiv'24] Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
[AAMAS'24] Holonic Learning: A Flexible Agent-based Distributed Machine Learning Framework
[arxiv'24] InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding
[VLDB'24] Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
[HPCA'24] Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
[NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
[EuroSys'24] HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

2023

[ICPP'23] Mercury: Fast and Optimal Device Placement for Large Deep Learning Models
[arxiv'23] TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections
[arxiv'23] vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training
[arxiv'23] ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
[arxiv'23] FlexModel: A Framework for Interpretability of Distributed Large Language Models
[arxiv'23] Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
[arxiv'23] RTP: Rethinking Tensor Parallelism with Memory Deduplication
[arxiv'23] FP8-LM: Training FP8 Large Language Models
[arxiv'23] Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
[arxiv'23] DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
[arxiv'23] A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
[arxiv'23] FLM-101B: An Open LLM and How to Train It with $100K Budget
[arxiv'23] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
[arxiv'23] Improving Automatic Parallel Training via Balanced Memory Workload Optimization
- extended version of Galvatron (VLDB'23)
[arxiv'23] Modeling Parallel Programs using Large Language Models
[arxiv'23] Proteus: Simulating the Performance of Distributed DNN Training
[arxiv'23] Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training
[arxiv'23] Decoupled Model Schedule for Deep Learning Training
[arxiv'23] RAF: Holistic Compilation for Deep Learning Model Training
[arxiv'23] Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches
[arxiv'23] Does compressing activations help model parallel training?
[arxiv'23] Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
[arxiv'23] Scaling Vision Transformers to 22 Billion Parameters
[arxiv'23] Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
[arxiv'23] TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation
[arxiv'23] SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
[arxiv'23] ATP: Adaptive Tensor Parallelism for Foundation Models
[arxiv'23] AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication
[IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
[NeurIPS'23] DeepPCR: Parallelizing Sequential Operations in Neural Networks
[DAC'23] MixPipe: Efficient Bidirectional Pipeline Parallelism for Training Large-Scale Models
[SC'23] Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
[SOSP'23] PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
[SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
[TPDS'23] Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models
[HPCA'23] Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism
[ACL'23] Sequence Parallelism: Long Sequence Training from System Perspective
[CCGrid'23] A Deep Learning Pipeline Parallel Optimization Method
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
[ATC'23] MSRL: Distributed Reinforcement Learning with Dataflow Fragments
[Survey 🔍] [TPDS'23] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
[ICML'23] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
[ICML'23] BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
[NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
[NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
[NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
[MLSys'23] On Optimizing the Communication of Model Parallelism
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
[TPDS'23] Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
[PPoPP'23] Elastic Averaging for Efficient Pipelined DNN Training
[PPoPP'23] Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
[VLDB'23] MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
[VLDB'23] Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
[ASPLOS'23] Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

2022

[arxiv'22] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
[arxiv'22] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
[arxiv'22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
[ICPP'22] Tesseract: Parallelize the Tensor Parallelism Efficiently
[MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning
- arxiv
[NeurIPS'22] Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees
[SoCC'22] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
[MLSys'22] Pathways: Asynchronous distributed dataflow for ML
[MLSys'22] SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud
[MLSys'22] Efficient Strong Scaling Through Burst Parallel Training
[EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
[ATC'22] Whale: Efficient Giant Model Training over Heterogeneous GPUs
[NeurIPS'22] AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
[HPDC'22] Hare: Exploiting Inter-job and Intra-job Parallelism of Distributed Machine Learning on Heterogeneous GPUs
[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
[NSDI'22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks

2021

[arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
[arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
[JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
[TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
[ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
[SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
[MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
[ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
[NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
[ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
[ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
[ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
[SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
[SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (PTD-P or Megatron-LM v2)
[FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
[PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
[VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches

2020

[HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
[NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
[arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
[KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
[VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
[OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (BytePS)
[SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
[NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
- arxiv
[arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]

~2019

[HPCA'19] HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
[IEEE MICRO'19] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
[MLSys'19] Beyond data and model parallelism for deep neural networks (FlexFlow)
[MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
[EuroSys'19] Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
[EuroSys'19] Supporting Very Large Models using Automatic Dataflow Graph Partitioning (Tofu)
[SOSP'19] A Generic Communication Scheduler for Distributed DNN Training Acceleration
[NeurIPS'19] Mesh-TensorFlow: Deep Learning for Supercomputers
[NeurIPS'19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
[ICML'18] Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

Survey Papers

[Survey 🔍] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
[Survey 🔍] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
[Survey 🔍] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools

2.6 DL job failures

[ATC'22] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure
[ICSE'20] An Empirical Study on Program Failures of Deep Learning Jobs

2.7 Model checkpointing

[FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing

2.8 AutoML

[OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
[NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
[OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

2.9 Communication optimization

[arxiv'24] MLTCP: Congestion Control for DNN Training
[arxiv'24] Accelerating Distributed Deep Learning using Lossless Homomorphic Compression
[NSDI'24] THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
[arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
[arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
[arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
[arxiv'23] Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training
[arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
[arxiv'23] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
[arxiv'23] TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training
[ICML'23] CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
- Related to DT-FM (NeurIPS'22)
[IPDPS'23] MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
[ASPLOS'23] MSCCLang: Microsoft Collective Communication Language
[ASPLOS'23] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
[EuroSys'23] A2TP: Aggregator-aware In-network Aggregation for Multi-tenant Learning
[EuroSys'23] Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
[MLSys'23] Cupcake: A Compression Optimizer for Scalable Communication-Efficient Distributed Training
[MLSys'23] On Optimizing the Communication of Model Parallelism
[NSDI'23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
[NSDI'23] TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
[ISCA'22] Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models
[SC'22] HammingMesh: A Network Topology for Large-Scale Deep Learning
[PPoPP'22] Near-optimal sparse allreduce for distributed deep learning
[MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning (P^2)
- arxiv
[SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.5]
[SC'21] Flare: flexible in-network allreduce
[NSDI'21] Scaling Distributed Machine Learning with In-Network Aggregation
[ISCA'21] Enabling compute-communication overlap in distributed deep learning training platforms
[PPoPP'21] Synthesizing optimal collective algorithms (SCCL)
[PPoPP'20] Taming unbalanced training workloads in deep learning with partial collective operations
[MLSys'20] Blink: Fast and Generic Collectives for Distributed ML
[MLSys'20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training
[MLSys'19] Priority-based Parameter Propagation for Distributed DNN Training (P3)
[MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
[SOSP'19] A generic communication scheduler for distributed DNN training acceleration (ByteScheduler)
[ATC'17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

2.10 Energy-efficient DNN training (carbon-aware)

[arxiv'23] Perseus: Removing Energy Bloat from Large Model Training
[arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
[ATC'23] EnvPipe: Performance-preserving DNN Training Framework for Saving Energy
[NSDI'23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

2.11 DNN compiler

[OSDI'23] Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
[OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
[OSDI'23] Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
[OSDI'23] EINNET: Optimizing Tensor Programs with Derivation-Based Transformations
[OSDI'22] ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
[OSDI'20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
[OSDI'20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
[ASPLOS'20] FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
[OSDI'18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

2.12 Model pruning and compression

2.13 GNN training system

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

[VLDB'24] NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams
[arxiv'23] ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training
[arxiv'23] Helios: An Efficient Out-of-core GNN Training System on Terabyte-scale Graphs with In-memory Performance
[arxiv'23] GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism
[MLSys'23] Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
[SIGMOD'23] DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
[EuroSys'23] MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks
[KDD'22] Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs
[VLDB'22] TGL: a general framework for temporal GNN training on billion-scale graphs
[OSDI'21] P3: Distributed Deep Graph Learning at Scale

2.14 Congestion control for DNN training

[arxiv'24] MLTCP: Congestion Control for DNN Training
[HotNets'22] Congestion Control in Machine Learning Clusters

3. Inference System

[arxiv'24] Wisdom of Committee: Distilling from Foundation Model to SpecializedApplication Model
[arxiv'24] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
[PPoPP'24 poster] POSTER: LLM-PQ:Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
[NSDI'24] Approximate Caching for Efficiently Serving Diffusion Models
[NSDI'24] Characterization of Large Language Model Development in the Datacenter
[arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
[arxiv'24] ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
[arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
[arxiv'24] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
[arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
[arxiv'24] CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
[arxiv'24] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
[arxiv'24] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
[arxiv'24] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
[Survey 🔍] [arxiv'24] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
[arxiv'24] Learned Best-Effort LLM Serving
[arxiv'24] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
[VLDB'24] Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
[ASPLOS'24] SpotServe: Serving Generative Large Language Models on Preemptible Instances
[arxiv'23] Splitwise: Efficient generative LLM inference using phase splitting
[arxiv'23] SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification
[arxiv'23] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
[arxiv'23] Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
[arxiv'23] Fairness in Serving Large Language Models
[arxiv'23] HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment
[arxiv'23] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
[arxiv'23] Punica: Multi-Tenant LoRA Serving
[arxiv'23] Pipeline Parallelism for DNN Inference with Practical Performance Guarantees
[arxiv'23] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
[arxiv'23] High-throughput Generative Inference of Large Language Models with a Single GPU
[HPDC'23] Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
[SOSP'23] Paella: Low-latency Model Serving with Virtualized GPU Scheduling
[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
[MLSys'23] Efficiently Scaling Transformer Inference
[EuroSys'23] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access
[EuroSys'23] Tabi: An Efficient Multi-Level Inference System for Large Language Models
[EuroSys'23] Pocket: ML Serving from the Edge
[OSDI'23] AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
[NSDI'23] SHEPHERD: Serving DNNs in the Wild
[VLDB'23] Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
[ICML'23] Fast Inference from Transformers via Speculative Decoding
[SIGMOD'22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving
[OSDI'22] Orca: A Distributed Serving System for Transformer-Based Generative Models
[OSDI'22] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
[ATC'22] SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
[ATC'22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
[ATC'22] Tetris: Memory-efficient Serverless Inference through Tensor Sharing
[ATC'22] PetS: A Unified Framework for Parameter-Efficient Transformers Serving
[ATC'21] INFaaS: Automated Model-less Inference Serving
[SoCC'21] Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving
[arxiv'21] Supporting Massive DLRM Inference through Software Defined Memory
[MobiCom'20] SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud

4. Federated Learning

[SAC'24] Training Heterogeneous Client Models using Knowledge Distillation in Serverless Federated Learning
[arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
[arxiv'23] Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
[IMWUT'23] AttFL: A Personalized Federated Learning Framework for Time-series Mobile and Embedded Sensor Data Processing
[Survey 🔍] [FGCS'23] Model aggregation techniques in federated learning: A comprehensive survey
[SoCC'23] Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering
[MLSys'23] GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning
[WWW'23] To Store or Not? Online Data Selection for Federated Learning with Limited Storage
[EuroSys'23] REFL: Resource-Efficient Federated Learning
[VLDB'23] FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
[RecSys'22] Towards Fair Federated Recommendation Learning: Characterizing the Inter-Dependence of System and Data Heterogeneity
[TMLR'22] Optimal Client Sampling for Federated Learning
[ICML'22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
[MobiSys'22] FedBalancer: data and pace control for efficient federated learning on heterogeneous clients
[MobiCom'22] PyramidFL: A Fine-grained Client Selection Framework for Efficient Federated Learning
[MLSys'22] PAPAYA: Practical, Private, and Scalable Federated Learning
[AISTATS'22] Federated Learning with Buffered Asynchronous Aggregation
[NeurIPS'21] Federated Reconstruction: Partially Local Federated Learning
[NeurIPS'21] FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout
[OSDI'21] Oort: Efficient Federated Learning via Guided Participant Selection
[MICRO'21] AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning
[MLSys'19] Towards Federated Learning at Scale: System Design
[Survey 🔍] [ACM CSUR'22] Federated Learning for Smart Healthcare: A Survey

5. Privacy-Preserving ML

[DAC'23] Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators
[ICLR'23] MPCFormer: fast, performant and private Transformer inference with MPC
[NeurIPS'22] Iron: Private Inference on Transformers

6. ML APIs & Application-side Optimization

[arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
[OSDI'24 (to appear)] Automatic and Efficient Customization of Neural Networks for ML Applications
[ICML'22] Efficient Online ML API Selection for Multi-Label Classification Tasks (FrugalMCT)
[NeurIPS'20] FrugalML: How to use ML Prediction APIs more accurately and cheaply

7. ML for Systems

[arxiv'24] Large Language Model Adaptation for Networking
[arxiv'24] LLM-Enhanced Data Management
[arxiv'24] MPIrigen: MPI Code Generation through Domain-Specific Language Models
[arxiv'24] Can Large Language Models Write Parallel Code?
[arxiv'23] LLM-Assisted Code Cleaning For Training Accurate Code Generators
[arxiv'23] Large Language Models for Compiler Optimization
[VLDB'23] How Large Language Models Will Disrupt Data Management

Others

[arxiv'24] You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
[arxiv'24] Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
[Survey 🔍] [arxiv'24] A Survey of Resource-efficient LLM and Multimodal Foundation Models
[arxiv'23] Efficiently Programming Large Language Models using SGLang
[MICRO'23] Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads

References

This repository is motivated by:

Paper List for Machine Learning Systems

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

1.1.2 Prep stalls

1.1.3 Fetch stalls (I/O)

1.1.4 Specific workloads (GNN, DLRM)

1.2 Caching and Distributed storage for ML training

1.3 Data formats

1.4 Data pipeline fairness and correctness

1.5 Data labeling automation

2. Training System

2.1 DL scheduling

2.2 GPU sharing

2.3 GPU memory management and optimization

2.4 GPU memory usage estimate

2.5 Distributed training (Parallelism)

2024

2023

2022

2021

2020

~2019

Survey Papers

2.6 DL job failures

2.7 Model checkpointing

2.8 AutoML

2.9 Communication optimization

2.10 Energy-efficient DNN training (carbon-aware)

2.11 DNN compiler

2.12 Model pruning and compression

2.13 GNN training system

2.14 Congestion control for DNN training

3. Inference System

4. Federated Learning

5. Privacy-Preserving ML

6. ML APIs & Application-side Optimization

7. ML for Systems

Others

References

About