CS692 Seminar: Systems for Machine Learning, Machine Learning for Systems

Course website: https://guanh01.github.io/teaching/2020-fall-mlsys

This is the (evolving) reading list for the seminar. The papers are from top ML venues (ICML, ICLR, etc) and system venues (ASPLOS, PLDI, etc). The selection criteria is whether some keywords are in paper title.

Topics of interest include, but are not limited to (copied from MLSys website):

Efficient model training, inference, and serving
Distributed and parallel learning algorithms
Privacy and security for ML applications
Testing, debugging, and monitoring of ML applications
Fairness, interpretability and explainability for ML applications
Data preparation, feature selection, and feature extraction
ML programming models and abstractions
Programming languages for machine learning
Visualization of data, models, and predictions
Specialized hardware for machine learning
Hardware-efficient ML methods
Machine Learning for Systems

Systems for Machine Learning

Distributed and Parallel Learning

[USENIX ATC'20]HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism
[ICLR'20]Decentralized Deep Learning with Arbitrary Communication Compression
[ASPLOS'20]Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training. Prague is a high-performance heterogeneity-aware asynchronous decentralized training approach that improves the performance for high heterogeneity systems. It has two contributions. First, it reduces synchronization cost via Partial All-Reduce that enables fast synchronization among a group of workers. Second, it reduces serialization cost via static group scheduling in homogeneous environment and simple techniques, i.e., Group Buffer and Group Division, to largely eliminate conflicts with slightly reduced randomness.
[EuroSys'20]Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning This paper talks about the scheduling of GPU clusters, which is not necessary related to deep learning.
[EuroSys'19]Supporting Very Large Models using Automatic Dataflow Graph Partitioning
[SOSP'19]A generic communication scheduler for distributed DNN training acceleration Partition and rearrange the tensor transmissions without changing the code of underlying framework, such as TensorFlow, PyTorch, and MXNet, by reducing the communication overhead.
[SOSP'19]PipeDream: generalized pipeline parallelism for DNN training Proposed a inter-batch pipelining to improve parallel training throughput.
[Survey'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools by Mayer, Ruben, and Hans-Arno Jacobsen. ACM Computing Surveys (CSUR)

Efficient Training

DNN Training

[USENIX ATC'20]Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training
[MLSys'20]MLPerf Training Benchmark
[ICLR'20(talk)]Reformer: The Efficient Transformer
[ICLR'20(spotlight)]Drawing Early-Bird Tickets: Toward More Efficient Training of Deep Networks
[ICLR'20]Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints
[ICML'20]DeltaGrad: Rapid retraining of machine learning models
[ASPLOS'20]Capuchin: Tensor-based GPU Memory Management for Deep Learning
[ASPLOS'19]Split-CNN: Splitting Window-based Operations in Convolutional Neural Networks for Memory System Optimization The paper addresses the issue of non-sufficient memory issue of GPUs for CNN training. It proposes two approaches. First, it splits a CNN network to multiple smaller ones. Second, it proposes to utilize nv-link to implement memory management offloading.

GNN Training

[ICLR'20(talk)]GraphZoom: A Multi-level Spectral Approach for Accurate and Scalable Graph Embedding
[MLSys'20]Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc The paper presents a distributed multi-GPU framework for fast GNN training and inference on graphs. ROC tackles two significant system challenges for distributed GNN computation: Graph partitioning and (2) memory management. Graph partitioning is based on an online-trained linear regression model. The memory management decides in which device memory to store each intermediate tensor to minimize data transfter cost. ROC introduces a dynamic programming algorithm to minimize data transfer cost.
[IPDPS'20]PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional Network
[ArXiv'20]Reducing Communication in Graph Neural Network Training
[ICLR'20]GraphSAINT: Graph Sampling Based Inductive Learning Method
[NIPS'19]Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks
[KDD'19]Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks
[ATC'19]NeuGraph: Parallel Deep Neural Network Computation on Large Graphs NeuGraph is a new framework that bridges the graph and dataflow models to support efficient and scalable parallel neural network computation on graphs. Basically, it manages data partitioning, scheduling, and parallelism in dataflow-based deep learning frameworks in order to achieve better performance for GNN training.
[ICLR'19 Workshop]Deep graph library: Towards efficient and scalable deep learning on graphs
[VLDB'19]AliGraph: A Comprehensive Graph Neural Network Platform

Neural Architecture Search

Continous Learning

[ICLR'20]Continual learning with hypernetworks
[ICLR'20]Continual Learning with Adaptive Weights (CLAW)
[ICLR'20]Scalable and Order-robust Continual Learning with Additive Parameter Decomposition

Efficient Inference

Compiler

Resource Management

[VLDB'21]Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics
[MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
[MLSys'20]Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference
[MobiSys'20]Fast and Scalable In-memory Deep Multitask Learning via Neural Weight Virtualization
[RTSS'19]Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference
[EuroSys'19]GRNN: Low-Latency and Scalable RNN Inference on GPUs The paper improves the performance of RNN inference by providing a GPU-based RNN inference library, called GRNN, that provides low latency, high throughput, and efficient resource utilization.
[EuroSys'19]μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization μLayer is a low latency on-device inference runtime that significantly improves the latency of NN-assisted services. μLayer accelerates each NN layer by simultaneously utilizing diverse heterogeneous processors on a mobile device and by performing computations using processor-friendly quantization. First, to accelerate an NN layer using both the CPU and the GPU at the same time, μLayer employs a layer distribution mechanism which completely removes redundant computations between the processors. Next, μLayer optimizes the per-processor performance by making the processors utilize different data types that maximize their utilization. In addition, to minimize potential latency increases due to overly aggressive workload distribution, μLayer selectively increases the distribution granularity to divergent layer paths.
[sysml@nips'18]Dynamic Space-Time Scheduling for GPU Inference
[MobiSys'17]DeepEye: Resource Efficient Local Execution of Multiple Deep Vision Models using Wearable Commodity Hardware

Compression

Pruning

[ICML'20]PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
[MLSys'20]What is the State of Neural Network Pruning? This paper provides an overview of approaches to pruning. The finding is that the community sufers from a lack of standardized benchmarks and metrics. It is hard to compare pruning techniques to one another or determine the progress the filed has made. The paper also introduce ShrinkBench, an open-source framework to faciliate standardized evaluations of pruning methods.
[ICML'20]PENNI: Pruned Kernel Sharing for Efficient CNN Inference
[ICML'20]Operation-Aware Soft Channel Pruning using Differentiable Masks
[ICML'20]DropNet: Reducing Neural Network Complexity via Iterative Pruning
[ICLR'20]A Signal Propagation Perspective for Pruning Neural Networks at Initialization
[ICML'20]Network Pruning by Greedy Subnetwork Selection
[ICLR'20][Talk]Comparing Rewinding and Fine-tuning in Neural Network Pruning
[ICLR'20]Lookahead: A Far-sighted Alternative of Magnitude-based Pruning
[ICLR'20]Provable Filter Pruning for Efficient Neural Networks
[ICLR'20]Dynamic Model Pruning with Feedback
[ICLR'20]One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation

Quantization

[ICML'20]Boosting Deep Neural Network Efficiency with Dual-Module Inference
[ICML'20]Towards Accurate Post-training Network Quantization via Bit-Split and Stitching
[ICML'20]Differentiable Product Quantization for Learning Compact Embedding Layers
[ICML'20]Online Learned Continual Compression with Adaptive Quantization Modules
[ICML'20]Multi-Precision Policy Enforced Training (MuPPET) : A Precision-Switching Strategy for Quantised Fixed-Point Training of CNNs
[ICML'20]Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks
[ICML'20]Up or Down? Adaptive Rounding for Post-Training Quantization
[ICLR'20]And the Bit Goes Down: Revisiting the Quantization of Neural Networks
[ICML'20]Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware
[ICLR'20]AutoQ: Automated Kernel-Wise Neural Network Quantization
[ICLR'20]Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks
[ICLR'20]Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks
[ICLR'20]Mixed Precision DNNs: All you need is a good parametrization
[MLSys'20]Riptide: Fast End-to-End Binarized Neural Networks This paper tries to speedup the implementation of binarized neural networks on Raspberry Pi 3B (RPi).
[MLSys'20]Memory-Driven Mixed Low Precision Quantization for Enabling Deep Network Inference on Microcontrollers The paper studies mixed low-bitwidth compression, featuring 8, 4 or 2-bit uniform quantization to enable integer-only operations on microcontrollers. It determines the minimum bit precision of every activation and weight tensor given the memory constraints of a device. The authors evaluate their approach using MobileNetV1 on a STM32H7 microcontroller.

Model Serving

[SOSP'19]Parity models: erasure-coded resilience for prediction serving systems Improving the performance of prediction serving system that could take in queries and return predictions by performing inference on models. This paper means to mitigate tail latency inflation.
[ATC'19]MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving The work focuses on the improvement of performance for ML-as-a-Service:: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale.
[OSDI'18]PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems: PRETZEL is a prediction serving system introducing anovel white box architecture enabling both end-to-endand multi-model optimizations. PRETZELis on average able to reduce 99th percentile la-tency by 5.5× while reducing memory footprint by 25×, and increasing throughput by 4.7×.

Testing and Debugging

[MLSys'20]Model Assertions for Monitoring and Improving ML Models The paper tries to monitor and improve ML models by using model assertions at all stages of ML system delopyment, including runtime monitoring and validating labels. For runtime monitoring, model assertions can find high confidence errors. For training, they propose a bandit-based active learning algorithm that can sample from data flagged by assertion to reduce labeling cost.

Robustness

[PLDI'20]Proving Data-Poisoning Robustness in Decision Trees This paper studies the robustness of decision tree models to the data poinsoning. It proposes a system that could still guarantee the correctness of model, even if the data set has been tempered.
[ASPLOS'20]DNNGuard: An Elastic Heterogeneous DNN Accelerator Architecture against Adversarial Attacks The paper proposes an elastic heterogeneous DNN accelerator architec-ture that can efficiently orchestrate the simultaneous execu-tion of original (target) DNN networks and thedetectalgo-rithm or network that detects adversary sample attacks.
[ICML'20]Proper Network Interpretability Helps Adversarial Robustness in Classification
[ICML'20]More Data Can Expand The Generalization Gap Between Adversarially Robust and Standard Models
[ICML'20]Dual-Path Distillation: A Unified Framework to Improve Black-Box Attacks
[ICML'20]Defense Through Diverse Directions
[ICML'20]Adversarial Robustness Against the Union of Multiple Threat Models
[ICML'20]Second-Order Provable Defenses against Adversarial Attacks
[ICML'20]Understanding and Mitigating the Tradeoff between Robustness and Accuracy
[ICML'20]Adversarial Robustness via Runtime Masking and Cleansing
[ICLR'20][Talk]Adversarial Training and Provable Defenses: Bridging the Gap
[ICLR'20]Defending Against Physically Realizable Attacks on Image Classification
[ICLR'20]Enhancing Adversarial Defense by k-Winners-Take-All
[ICLR'20]Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks
[ICLR'20]Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness
[ICLR'20]Robust Local Features for Improving the Generalization of Adversarial Training
[ICLR'20]GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification
[ICLR'20]Robust training with ensemble consensus
[ICLR'20]Fast is better than free: Revisiting adversarial training
[ICLR'20]EMPIR: Ensembles of Mixed Precision Deep Networks for Increased Robustness Against Adversarial Attacks
[ICLR'20]Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference

Other Metrics (Interpretability, Privacy, etc.)

[MLSys'20]Privacy-Preserving Bandits The paper tries to enable privacy in personalized recommendation. This paper proposes a technique Privacy-Preserving Bandits (P2B); a system that updates local agents by collecting feedback from other local agents in a differentially-private manner.
[ASPLOS'20]Shredder: Learning Noise Distributions to Protect Inference Privacy The work focuses on the cloud-based inference. It introduces the noise to the input data, but without sacrificing the accuracy of inference.
[ASPLOS'20]DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints. 385-399 DeepSniffer extracts the model architecture information by learning the relation between extracted architectural hints (e.g., volumes of memory reads/writes obtained by side-channel or bus snooping attacks) and model internal architectures.

Data Preparation

[MLSys'20]Attention-based Learning for Missing Data Imputation The paper focuses on mixed (discrete and continuous) data and proposes AimNet, an attention-based learning network for missing data imputation in HoloClean, a state-of-the-art ML-based data cleaning framework. The paper argues that attention should be a central component in deep learning architectures for data imputation.
[ICLR'20]Can gradient clipping mitigate label noise?
[ICLR'20]SELF: Learning to Filter Noisy Labels with Self-Ensembling

ML programming models

[MLSys'20]Sense & Sensitivities: The Path to General-Purpose Algorithmic Differentiation present Zygote, an algorithmic differentiation (AD) system for the Julia language.

Machine Learning for Systems

ML for ml system

ML for compiler

[MLsys'20]AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning

ML for programming languages

[PLDI'20]Learning Nonlinear Loop Invariants with Gated Continuous Logic Networks The paper proposes a new neuro architecture (Gated Continuous Logic Network(G-CLN)) to learn nonlinear loop invariants. Utilizing DNN to solve and understand the system issue.
[PLDI'20]Blended, Precise Semantic Program Embeddings This paper is utilizing ML for systems. Basically, it utilizes DNN to learn program embeddings, vector representations of pro-gram semantics. Existing approaches predominately learn to embed programs from their source code, and, as a result, they do not capture deep, precise program semantics. On the other hand, models learned from runtime information critically depend on the quality of program executions, thus leading to trainedmodels with highly variant quality. LiGer learns programrepresentations from a mixture of symbolic and concrete exe-cution traces.
[ICLR'18]Learning to Represent Programs with Graphs

ML for memory management

[ICML'20]An Imitation Learning Approach for Cache Replacement
[(ASPLOS'20]Learning-based Memory Allocation for C++ Server Workloads

General Reports

[MLSys Whitepaper'18]: SysML: The New Frontier of Machine Learning Systems[must-read]
[NeurIPS'15]Hidden Technical Debt in Machine Learning Systems

Other Resources

About

This is the (evolving) reading list for the seminar.

guanh01 / CS692-mlsys