Research Papers I Read

This repository lists up research papers that I've read, related to my research interest.
Click a [pdf] link to see the paper.

Big Data Framework

[1] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [pdf]

[2] Dimopoulos, Stratos, Chandra Krintz, and Rich Wolski. "Big data framework interference in restricted private cloud settings." Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016. [pdf]

[3] Kwak, Jaewon, et al. "In-memory caching orchestration for hadoop." Cluster, Cloud and Grid Computing (CCGrid), 2016 16th IEEE/ACM International Symposium on. IEEE, 2016. [pdf]

[4] Hwang, Eunji, et al. "CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters." 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 2018. [pdf]

Task Management

[1] Petrucci, Vinicius, et al. "Octopus-man: Qos-driven task management for heterogeneous multicores in warehouse-scale computers." 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015. [pdf]

System with AI

Machine & Deep Learning

[1] Hashemi, Milad, et al. "Learning Memory Access Patterns." arXiv preprint arXiv:1803.02329 (2018). [pdf]

Reinforcement Learning

[1] Mao, Hongzi, et al. "Resource management with deep reinforcement learning." Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016. [pdf] [code]

[2] Ipek, Engin, et al. "Self-optimizing memory controllers: A reinforcement learning approach." ACM SIGARCH Computer Architecture News. Vol. 36. No. 3. IEEE Computer Society, 2008. [pdf]

[3] Baker, Bowen, et al. "Designing neural network architectures using reinforcement learning." arXiv preprint arXiv:1611.02167 (2016). [pdf]

[4] Nishtala, Rajiv, et al. "Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads." High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 2017. [pdf]

[5] Mirhoseini, Azalia, et al. "Device placement optimization with reinforcement learning." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. [pdf]

[6] Oh, Jisun, and Yoonhee Kim. "Job placement using reinforcement learning in GPU virtualization environment." Cluster Computing (2020): 1-16.) [pdf]

System for Deep Learning

Basic Distributed Training System

[1] Chilimbi, Trishul, et al. "Project adam: Building an efficient and scalable deep learning training system." 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014. [pdf]

[2] Keuper, Janis, and Franz-Josef Preundt. "Distributed training of deep neural networks: Theoretical and practical limits of parallel scalability." 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC). IEEE, 2016. [pdf]

ML Framework

[1] Abadi, Martín, et al. "Tensorflow: A system for large-scale machine learning." 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016. [pdf]

[2] Girija, Sanjay Surendranath. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems." (2016). [pdf]

Parallelism

[1] Huang, Yanping, et al. "Gpipe: Efficient training of giant neural networks using pipeline parallelism." Advances in neural information processing systems. 2019. [pdf]

[2] Narayanan, Deepak, et al. "PipeDream: generalized pipeline parallelism for DNN training." Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP 19). ACM, 2019. [pdf] [code]

[3] Park, Jay H., et al. "Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement." arXiv preprint arXiv:1901.05803 (2019). [pdf]

[4] Hegde, Vishakh, and Sheema Usmani. "Parallel and distributed deep learning." (2016). [pdf]

[5] Jia, Zhihao, Matei Zaharia, and Alex Aiken. "Beyond data and model parallelism for deep neural networks." In proceedings of the conference on Systems and Machine Learning (SysML 2019). 2019. [pdf]

[6] Ono, Junya, Masao Utiyama, and Eiichiro Sumita. "Hybrid Data-Model Parallel Training for Sequence-to-Sequence Recurrent Neural Network Machine Translation." arXiv preprint arXiv:1909.00562 (2019). [pdf]

[7] Li, Youjie, et al. "Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training." Advances in Neural Information Processing Systems (NIPS). 2018. [pdf]

[8] Yi, Xiaodong, et al. "Optimizing distributed training deployment in heterogeneous GPU clusters." Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies (CoNEXT 20). 2020. [pdf] [code]

Microservice

[1] Crankshaw, Daniel, et al. "Clipper: A Low-Latency Online Prediction Serving System." NSDI. 2017. [pdf]

[2] Moritz, Philipp, et al. "Ray: A Distributed Framework for Emerging AI Applications." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018. [pdf]

GPU Memory Management

[1] Rhu, Minsoo, et al. "vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design." The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 2016. [pdf]

[2] Kim, Youngrang, et al. "Efficient Multi-GPU Memory Management for Deep Learning Acceleration." 2018 IEEE 3rd International Workshops on Foundations and Applications of Self* Systems (FAS* W). IEEE, 2018. [pdf]

[3] Meng, Chen, et al. "Training deeper models by GPU memory optimization on TensorFlow." Proc. of ML Systems Workshop in NIPS. 2017. [pdf]

[4] Li, Chen, et al. "A Framework for Memory Oversubscription Management in Graphics Processing Units." Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2019. [pdf]

[5] Jain, Animesh, et al. "Gist: Efficient data encoding for deep neural network training." 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018. [pdf]

[6] Yu, Peifeng, and Mosharaf Chowdhury. "Salus: Fine-grained gpu sharing primitives for deep learning applications." Proceedings of Machine Learning and Systems 2020 (MLSys 2020). [pdf] [code]

[7] Narayanan, Deepak, et al. "Memory-efficient pipeline-parallel dnn training." arXiv preprint arXiv:2006.09503 (2020). [pdf]

Parameter Server

[1] Li, Mu, et al. "Scaling distributed machine learning with the parameter server." 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014. [pdf]

[2] Jiang, Jiawei, et al. "Heterogeneity-aware distributed parameter servers." Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 2017. [pdf]

[3] Cui, Henggang, et al. "Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server." Proceedings of the Eleventh European Conference on Computer Systems (EuroSys). ACM, 2016. [pdf] [code]

Communication

[1] Zhang, Hao, et al. "Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters." 2017 USENIX Annual Technical Conference (USENIX ATC 17). 2017. [pdf]

[2] Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018). [pdf] [code]

[3] Kim, Soojeong, et al. "Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks." Proceedings of the Fourteenth EuroSys Conference 2019. ACM, 2019. [pdf] [code]

[4] Xue, Jilong, et al. "Fast Distributed Deep Learning over RDMA." Proceedings of the Fourteenth EuroSys Conference 2019. ACM, 2019. [pdf]

[5] Peng, Yanghua, et al. "A generic communication scheduler for distributed DNN training acceleration." the 27th ACM Symposium on Operating Systems Principles (SOSP 19) [pdf]

Decentralized Training

[1] Lian, Xiangru, et al. "Asynchronous decentralized parallel stochastic gradient descent." arXiv preprint arXiv:1710.06952 (2017). [pdf]

[2] Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems (NIPS). 2017. [pdf]

[3] Luo, Qinyi, et al. "Hop: Heterogeneity-aware decentralized training." Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 2019. [pdf]

[4] Kadav, Asim, and Erik Kruus. "ASAP: asynchronous approximate data-parallel computation." arXiv preprint arXiv:1612.08608 (2016). [pdf]

[5] Luo, Qinyi, et al. "Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training." Proceedings of the 24th ASPLOS. 2020. [pdf]

Cluster Scheduling

[1] Bao, Yixin, et al. "Online Job Scheduling in Distributed Machine Learning Clusters." IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 2018. [pdf]

[2] Xiao, Wencong, et al. "Gandiva: Introspective cluster scheduling for deep learning." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018. [pdf]

[3] Chaudhary, Shubham, et al. "Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning." Proceedings of the Fifteenth European Conference on Computer Systems (Eurosys 20). 2020. [pdf]

[4] Peng, Yanghua, et al. "Optimus: an efficient dynamic resource scheduler for deep learning clusters." Proceedings of the 13th EuroSys Conference. 2018. [pdf]

[5] Gu, Juncheng, et al. "Tiresias: A GPU cluster manager for distributed deep learning." 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 2019. [pdf] [code]

[6] Le, Tan N., et al. "AlloX: compute allocation in hybrid clusters." Proceedings of the 15th European Conference on Computer Systems (Eurosys 20). 2020. [pdf]

[7] Han, Jingoo, et al. "MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems." 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2020. [pdf]

[8] Xiao, Wencong, et al. "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020. [pdf] [code]

[9] Narayanan, Deepak, et al. "Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020. [pdf] [code]

[10] Zhao, Hanyu, et al. "HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020. [pdf] [code]

Synchronization (Convergence)

[1] Zhang, Chengliang, et al. "Stay Fresh: Speculative Synchronization for Fast Distributed Machine Learning." 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2018. [pdf]

[2] Dean, Jeffrey, et al. "Large Scale Distributed Deep Networks." Advances in neural information processing systems (NIPS). 2012. [pdf]

[3] Goyal, Priya, et al. "Accurate, large minibatch sgd: Training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017). [pdf]

[4] Bottou, Léon, and Olivier Bousquet. "The tradeoffs of large scale learning." Advances in neural information processing systems (NIPS). 2008. [pdf]

[5] Chen, Jianmin, et al. "Revisiting distributed synchronous SGD." arXiv preprint arXiv:1604.00981 (2016). [pdf]

[6] Oh, Hyungjun, et al. "Convergence-aware neural network training." 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 2020.) [pdf]

Performance Metric (Benchmark)

[1] Coleman, Cody, et al. "Dawnbench: An end-to-end deep learning benchmark and competition." Advances in neural information processing systems (NIPS). 2017. [pdf]

[2] Mattson, Peter, et al. "Mlperf training benchmark." arXiv preprint arXiv:1910.01500 (2019). [pdf]

Data Partitioning

[1] Wei, Kai, et al. "How to intelligently distribute training data to multiple compute nodes: Distributed machine learning via submodular partitioning." Neural Information Processing Society (NIPS) Workshop, Montreal, Canada. 2015. [pdf]

Code Optimization

[1] Chen, Tianqi, et al. "TVM: An automated end-to-end optimizing compiler for deep learning." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018. [pdf]

Model Testing

[1] Pei, Kexin, et al. "Deepxplore: Automated whitebox testing of deep learning systems." Proceedings of the 26th Symposium on Operating Systems Principles (SOSP 17). ACM, 2017. [pdf]

[2] Lee, Yunseong, et al. "PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018. [pdf]

Serving

[1] Zhang, Minjia, et al. "Deepcpu: Serving rnn-based deep learning models 10x faster." 2018 USENIX Annual Technical Conference (USENIX ATC 18). 2018. [pdf]

Federated Learning

[1] Bonawitz, Keith, et al. "Towards federated learning at scale: System design." arXiv preprint arXiv:1902.01046 (2019). [pdf]

[2] McMahan, H. Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." arXiv preprint arXiv:1602.05629 (2016). [pdf]

Compression (Pruning, Quantization, Precision)

[1] Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv:1510.00149 (2015). [pdf]

[2] Alistarh, Dan, et al. "QSGD: Communication-efficient SGD via gradient quantization and encoding." Advances in Neural Information Processing Systems. 2017. [pdf]

[3] Zhou, Aojun, et al. "Incremental network quantization: Towards lossless cnns with low-precision weights." arXiv preprint arXiv:1702.03044 (2017). [pdf]

[4] Wen, Wei, et al. "Learning structured sparsity in deep neural networks." Advances in neural information processing systems. 2016. [pdf]

[5] Micikevicius, Paulius, et al. "Mixed precision training." arXiv preprint arXiv:1710.03740 (2017). [pdf]

[6] Luo, Jian-Hao, and Jianxin Wu. "Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference." arXiv preprint arXiv:1805.08941 (2018). [pdf]

Multi-task Learning

[1] Liu, Sulin, Sinno Jialin Pan, and Qirong Ho. "Distributed multi-task relationship learning." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. [pdf]

Hyper-parameter Optimization

[1] Ahnjae Shin, et al. "Stage-based hyper-parameter optimization for deep learning" Systems for ML Workshop at NeurIPS 2019. [pdf]

[2] Smith, Samuel L., et al. "Don't decay the learning rate, increase the batch size." ICLR. 2018. [pdf]

[3] Li, et al. "A System for Massively Parallel Hyperparameter Tuning" Proceedings of Machine Learning and Systems 2020 (MLSys 2020). [pdf]

[4] Falkner, Stefan, Aaron Klein, and Frank Hutter. "BOHB: Robust and efficient hyperparameter optimization at scale." Proceedings of ICML 2018. [pdf]

[5] Shin, Ahnjae, et al. "Hippo: Taming Hyper-parameter Optimization of Deep Learning with Stage Trees." arXiv preprint arXiv:2006.11972 (2020). [pdf]

[6] Liaw, Richard, et al. "Tune: A research platform for distributed model selection and training." ICML AutoML workshop 2018. [pdf]

[7] Golovin, Daniel, et al. "Google vizier: A service for black-box optimization." Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017. [pdf]

[8] Stich, Sebastian, Amirkeivan Mohtashami, and Martin Jaggi. "Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates." International Conference on Artificial Intelligence and Statistics. PMLR, 2021. [pdf]

[9] Rocha, Isabelly, et al. "PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters." Proceedings of the 21st International Middleware Conference. 2020. [pdf]

Ensemble Training

[1] Pittman, Randall, et al. "Exploring flexible communications for streamlining DNN ensemble training pipelines." SC 18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018. [pdf]

[2] Guan, Hui, et al. "FLEET: Flexible Efficient Ensemble Training for Heterogeneous Deep Neural Networks." Proceedings of Machine Learning and Systems 2020 (MLSys 2020). [pdf]

Adaptive Training (Resource & Hyper-parameter)

[1] Chen, Chen, et al. "Fast distributed deep learning via worker-adaptive batch sizing." Proceedings of the ACM Symposium on Cloud Computing (SoCC). 2018. [pdf]

[2] Or, Andrew, Haoyu Zhang, and Michael J. Freedman. "Resource Elasticity in Distributed Deep Learning." Proceedings of Machine Learning and Systems 2020 (MLSys 2020). [pdf]

[3] Mai, Luo, et al. "KungFu: Making Training in Distributed Machine Learning Adaptive." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020. [pdf] [code]

[4] Lin, Haibin, et al. "Dynamic mini-batch sgd for elastic distributed training: Learning in the limbo of resources." arXiv preprint arXiv:1904.12043 (2019). [pdf]

[5] Johnson, Tyler, et al. "AdaScale SGD: A user-friendly algorithm for distributed training." International Conference on Machine Learning. PMLR, 2020. [pdf]

Storage & NVM

[1] Eisenman, Assaf, et al. "Bandana: Using non-volatile memory for storing deep learning models." Proceedings of Machine Learning and Systems 2019 (MLSys 2019). [pdf]

[2] Kumar, Abhishek Vijaya, and Muthian Sivathanu. "Quiver: An informed storage cache for Deep Learning." 18th USENIX Conference on File and Storage Technologies (FAST 20). 2020. [pdf]

Input Pipeline

[1] Zhu, Yue, et al. "Entropy-aware I/O pipelining for large-scale deep learning on HPC systems." 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2018. [pdf]

[2] Mohan, Jayashree, et al. "Analyzing and Mitigating Data Stalls in DNN Training." Proceedings of VLDB Endowment 2021. [pdf] [code]

System for Reinforcement Learning

Parallel Method

[1] Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." ICML. 2016. [pdf]

[2] Nair, Arun, et al. "Massively parallel methods for deep reinforcement learning." arXiv preprint arXiv:1507.04296 (2015). [pdf] [code]

gyeongchan-yun / Research-Papers-I-Read