dq-soulie / LLMs-Inference-Papers

💻持续更新:整理LLM推理论文及代码资源,含PDF;包含LLM.int8()、SmoothQuant、WINT8/4、Continuous Batching(动态插入)、FP8、FlashAttention 1/2、PagedAttention、RoPE等。

Home Page:https://github.com/DefTruth/LLMs-Inference-Papers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LLMs-Inference-Papers

🌟说明

持续更新:最近想整体地读一下LLMs推理优化相关的Papers,但发现各个博客文章介绍到的知识点涉及的论文比较分散,于是将自己关注的一些LLMs推理优化技术论文整理成册,便于自己阅读查找,也在这里分享出来。格式:PDF,带标签,可跳转。更多论文可见,✅LLM推理论文列表,欢迎star🌟👨‍💻~

✅PDF下载

点击下载:

或命令行下载:

wget https://github.com/DefTruth/LLMs-Inference-Papers/releases/download/v0.1/LLMs-Inference-Papers-v0.1.zip
wget https://github.com/DefTruth/LLMs-Inference-Papers/releases/download/v0.2/LLMs-Inference-Papers-v0.2.zip

🎉PDF更新

  • LLMs-Inference-Papers-v0.1.pdf: LLMs入门,偏优化,600页PDF。涉及Transformer、BN、LN、MQA、FlashAttention、FlashAttention2、GLM、GLM-130B、GPT-3、GPT-3.5、GPT-4、LLaMA 1、2、LoRA、QLoRA、P-Tuning V1/V2、RoPE、SmoothQuant、WINT8/4、Continuous Batching(动态插入)、FP8等。
LLMs-Inference-Papers-v0 1_For_Beginners
  • LLMs-Inference-Papers-v0.2.pdf: LLMs推理优化论文(精简版,仅包含推理优化论文),286页PDF。包含ByteTransformer、FastServe、FlashAttention、FlashAttention-2、FlexGen、FP8、LLM.int8()、Tensor Core相关、PagedAttention、RoPE、SmoothQuant、SpecInfer、WINT8/4、Continuous Batching、ZeroQuant等。
v0 2

📁LLM推理论文列表

Date Title Paper Code
2022.10 [ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs [arxiv][pdf] [GitHub] [ByteTransformer]
2022.07 [Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models [osdi22-yu][pdf] -
2023.05 [FastServe] Fast Distributed Inference Serving for Large Language Models [arxiv][pdf] -
2022.05 [FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness [arxiv][pdf] [GitHub][flash-attention]
2023.07 [FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning [arxiv][pdf] [GitHub][flash-attention]
2023.03 [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU [arxiv][pdf] [GitHub][FlexGen]
2022.09 [FP8] FP8 FORMATS FOR DEEP LEARNING [arxiv][pdf] -
2022.08 [LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale [arxiv][pdf] [GitHub][bitsandbytes]
2018.03 [Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision [arxiv][pdf] -
2018.05 [Online Softmax] Online normalizer calculation for softmax [arxiv][pdf] -
2023.09 [PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention [arxiv][pdf] [GitHub][vllm]
2023.08 [Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library [arxiv][pdf] [GitHub][wmma_extension]
2021.04 [RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING [arxiv][pdf] [GitHub][transformers]
2022.11 [SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models [arxiv][pdf] [GitHub][smoothquant]
2023.05 [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification [arxiv][pdf] [GitHub][FlexFlow]
2022.11 [WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production [arxiv][pdf] [GitHub][FasterTransformer]
2022.06 [ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [arxiv][pdf] [GitHub][DeepSpeed]
2023.03 [ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation [arxiv][pdf] [GitHub][DeepSpeed]
2023.07 [ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats [arxiv][pdf] [GitHub][DeepSpeed]

©️License

GNU General Public License v3.0

About

💻持续更新:整理LLM推理论文及代码资源,含PDF;包含LLM.int8()、SmoothQuant、WINT8/4、Continuous Batching(动态插入)、FP8、FlashAttention 1/2、PagedAttention、RoPE等。

https://github.com/DefTruth/LLMs-Inference-Papers

License:GNU General Public License v3.0