LLMServe / SwiftTransformer

High performance Transformer implementation in C++.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SwiftTransformer

SwiftTransformer is a tiny yet powerful implementation of the inference infrastructure for transformer model families. It aims at providing an easy-to-use framework for researchers to try on their ideas and iterate quickly. Yet it also supports popular features like model/pipeline parallelism, FlashAttention, Continuous Batching, PagedAttention and should works as a great foundation for researchers to build their prototype. Currently, DistServe and FastServe use SwiftTransformer as the execution backend.

It has the following advantages:

  • Tiny. It only contains essential code to run the LLM inference, thus you can get your hands on it and experiment your research ideas without much effort. In fact, this project is launched after the author tried to implement a research prototype on FasterTransformer.
  • Efficient. It is written in C++ and adopts custom CUDA kernels from xformers for performance. It also supports features like model/pipeline parallelism, FlashAttention, Continuous Batching and PagedAttention.
  • Easy-to-use. It provides Pytorch bindings for easy integration with Python, so you can easily build your own prototype in Python on top of it.
  • Well-documented. It has detailed documentation for researchers to hack around easily.

Build

NOTE: For users who want to run LLM inference off-the-shelf, please refer to other high-level LLM serving systems written in Python based on SwiftTransformer (like DistServe and FastServe). They all contain detailed documentation about environment setup.

If you want to build your own project on top of SwiftTransformer, please follow the following steps:

# setup and activate the conda environment
conda env create -f environment.yml && conda activate SwiftTransformer

# build SwiftTransformer
cmake -B build && cmake --build build -j$(nproc)

If everything works fine, you should see libst_pybinding.so under the SwiftTransformer/build/lib directory. You can load this dynamic library in your Python project.

Run

We provide a simple example to run the OPT-1.3B model. Again, if you want to run LLM inference off-the-shelf, please see DistServe and FastServe.

  • Download the tokenizer.

    mkdir models
    wget https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-merges.txt -O models/gpt2-merges.txt
    wget https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-vocab.json -O models/gpt2-vocab.json
  • Download the OPT-1.3B model weights.

    wget https://dl.fbaipublicfiles.com/opt/v1_20230405/1.3b/reshard-model_part-0.pt -O models/opt-1.3b.pt

    Note: Please do not choose OPT-350M since its architecture is different from others.

  • Convert the weight format. The weight file is stored in .pt format (generated by torch.save()), which cannot be loaded by LibTorch, so we need to convert it.

    Use python3 scripts/converter.py --input <path/to/your/downloaded/model> --output <path/to/converted/weights> --dtype <datatype (fp16 or fp32)> --model <modelname (opt or llama2)>

    python3 scripts/converter.py --input models/opt-1.3b.pt --output models/opt-1.3b-conv-fp16.pt --dtype fp16 --model opt
  • Prepare your input.. Use python3 scripts/encode_input.py <path/to/vocab.json> <path/to/merges.txt> to encode your input. This script accepts your requests from stdin (one per line) and outputs the encoded input to stdout.

    mkdir inputs
    printf "Life blooms like a flower. Far away or by the road. Waiting for the one, to\nA quick brown fox\nArtificial intelligence is\nTo be or not to be," > inputs/input1_plain.txt
    python3 scripts/encode_input.py models/gpt2-vocab.json models/gpt2-merges.txt < inputs/input1_plain.txt > inputs/input1_encoded.txt
  • Run the model.

    build/bin/run_opt models/opt-1.3b-conv-fp16.pt 1.3b models/gpt2-vocab.json fp16 inputs/input1_encoded.txt

Testing

We provide various unit tests to test the correctness of components of the model. To run the test, please compile the project, and then execute bin/unittest_XXX in the build directory.

Development

Code Structure

Currently, the code is organized as follows:

src
├── csrc
│   ├── kernel
│   ├── layer
│   ├── model
│   ├── pybinding.cc
│   └── util
├── examples
│   ├── benchmark_all_input_same.cc
│   ├── CMakeLists.txt
│   ├── lib
│   └── run_gpt.cc
└── unittest
    ├── kernel
    ├── layer
    ├── model
    ├── unittest_torch_utils.h
    ├── unittest_utils.h
    └── util

The csrc folder contains the core implementation of the model, including every kernel, layer and model.

The unittest folder contains unit tests for the components in csrc. The kernel, layer, model, and util folders under the unittest folder contain the implementation of the corresponding components. For example, src/unittest/layer/attention.cc contains the unit test for the Attention layer, which is implemented in src/csrc/layer/attention.cc.

Note for vscode users: If you encounter #include errors detected. Please update your includePath., you may need to update include path in .vscode/c_cpp_properties.json.

Design Philosophy

  • Well-documented. We strongly believe that a well-documented codebase boosts the efficiency of research. Therefore we try our best to document every function and class. Typically we explain the purpose and meanings of arguments of a function before its implementation in the .cc file.
  • POP-styled design. Different from FastTransformer which adopts an Object-oriented programming (OOP) design, we adopt a more Procedure-Oriented Programming (POP) style. We believe that POP is more suitable for research projects, since it is easier to extend and modify the code. Think why we need OOP, and you will find the answer is "to hide the details". However in research projects, we need to know, and alter the details. Therefore all kernels and layers are implemented in POP style.
  • Extensive unit tests. Every kernel and layer is paired with a unit test. We believe that unit tests are essential for research projects, since they can help us to verify the correctness of our implementation. We use googletest as our unit test framework. With the help of TYPED_TEST from googletest, we can test our kernels and layers with different data types (e.g. float and half) without writing redundant code.
  • LibTorch for reference in unit tests. For the "reference" part in unittests, we use LibTorch to implement the same kernel or layer. This is because LibTorch is well-tested, and we can use it as a reference to verify the correctness of our implementation.
  • Raw pointers instead of at::Tensor. We prefer the raw pointer in C over at::Tensor (The tensor class provided by LibTorch, the C++ frontend of PyTorch), since we need fine-grained control over the memory layout.

About

High performance Transformer implementation in C++.


Languages

Language:C++ 61.6%Language:Cuda 26.8%Language:Python 7.1%Language:CMake 4.4%Language:C 0.2%