LLMs-Inference-Papers

🌟说明

持续更新：最近想整体地读一下LLMs推理优化相关的Papers，但发现各个博客文章介绍到的知识点涉及的论文比较分散，于是将自己关注的一些LLMs推理优化技术论文整理成册，便于自己阅读查找，也在这里分享出来。格式：PDF，带标签，可跳转。更多论文可见，✅LLM推理论文列表，欢迎star🌟👨‍💻~

✅PDF下载

点击下载：

LLMs-Inference-Papers-v0.1.zip: LLMs入门+推理入门
LLMs-Inference-Papers-v0.2.zip: 精简版，仅包含推理论文

或命令行下载：

wget https://github.com/DefTruth/LLMs-Inference-Papers/releases/download/v0.1/LLMs-Inference-Papers-v0.1.zip
wget https://github.com/DefTruth/LLMs-Inference-Papers/releases/download/v0.2/LLMs-Inference-Papers-v0.2.zip

🎉PDF更新

LLMs-Inference-Papers-v0.1.pdf: LLMs入门，偏优化，600页PDF。涉及Transformer、BN、LN、MQA、FlashAttention、FlashAttention2、GLM、GLM-130B、GPT-3、GPT-3.5、GPT-4、LLaMA 1、2、LoRA、QLoRA、P-Tuning V1/V2、RoPE、SmoothQuant、WINT8/4、Continuous Batching（动态插入）、FP8等。

LLMs-Inference-Papers-v0 1_For_Beginners

LLMs-Inference-Papers-v0.2.pdf: LLMs推理优化论文（精简版，仅包含推理优化论文），286页PDF。包含ByteTransformer、FastServe、FlashAttention、FlashAttention-2、FlexGen、FP8、LLM.int8()、Tensor Core相关、PagedAttention、RoPE、SmoothQuant、SpecInfer、WINT8/4、Continuous Batching、ZeroQuant等。

📁LLM推理论文列表

Date	Title	Paper	Code
2022.10	[ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs	[arxiv][pdf]	[GitHub] [ByteTransformer]
2022.07	[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models	[osdi22-yu][pdf]	-
2023.05	[FastServe] Fast Distributed Inference Serving for Large Language Models	[arxiv][pdf]	-
2022.05	[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness	[arxiv][pdf]	[GitHub][flash-attention]
2023.07	[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning	[arxiv][pdf]	[GitHub][flash-attention]
2023.03	[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU	[arxiv][pdf]	[GitHub][FlexGen]
2022.09	[FP8] FP8 FORMATS FOR DEEP LEARNING	[arxiv][pdf]	-
2022.08	[LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale	[arxiv][pdf]	[GitHub][bitsandbytes]
2018.03	[Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision	[arxiv][pdf]	-
2018.05	[Online Softmax] Online normalizer calculation for softmax	[arxiv][pdf]	-
2023.09	[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention	[arxiv][pdf]	[GitHub][vllm]
2023.08	[Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library	[arxiv][pdf]	[GitHub][wmma_extension]
2021.04	[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING	[arxiv][pdf]	[GitHub][transformers]
2022.11	[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models	[arxiv][pdf]	[GitHub][smoothquant]
2023.05	[SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification	[arxiv][pdf]	[GitHub][FlexFlow]
2022.11	[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production	[arxiv][pdf]	[GitHub][FasterTransformer]
2022.06	[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	[arxiv][pdf]	[GitHub][DeepSpeed]
2023.03	[ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation	[arxiv][pdf]	[GitHub][DeepSpeed]
2023.07	[ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats	[arxiv][pdf]	[GitHub][DeepSpeed]