BBuf/run-rwkv-world-4-in-mlc-llm

Run RWKV-World-4 in MLC-LLM Tutorial

This tutorial explains how to deploy the RWKV-World-4 model within the mlc-llm framework, covering CUDA, Mac M2, and Android platforms.

The 1.5B RWKV-World-4 model exhibits a 'nan' phenomenon when deployed using FP16. We recommend deploying with the 3B or 7B model instead.

Preparation

RWKV-4-World model URL: RWKV-4-World-3B or RWKV-4-World-7B
Download the tokenizer from here and extract the tokenizer_model.zip to a file named tokenizer_model. This is the Tokenizer file for the RWKV-4-World series model.
Clone the repositories https://github.com/mlc-ai/mlc-llm and https://github.com/mlc-ai/relax. Make sure to use the --recursive parameter when cloning to include their third-party library dependencies.

You can also choose to use the unity branch of TVM instead of Relax, as the development progress of TVM's unity branch is ahead of Relax.

CUDA

Compile Relax

git clone --recursive git@github.com:mlc-ai/relax.git
cd relax
mkdir build
cd build
cp ../cmake/config.cmake ./

Next, modify the config.cmake file in the build directory. Since I'm compiling on the A800, I made the following changes:

set(USE_CUDA ON)
set(USE_CUTLASS ON)
set(USE_CUBLAS ON)
set(USE_LLVM ON)

You must choose USE_LLVM option, otherwise the mlc.build model will fail.

This enables CUDA and turns on two acceleration libraries, CUTLASS and CUBLAS. Then, in the build directory, execute cmake .. && make -j32.

Lastly, you might consider adding Relax to the PYTHONPATH environment variable to make it globally accessible. Add the following lines to ~/.bashrc:

export TVM_HOME=/bbuf/relax
export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH}

Then, simply run source ~/.bashrc.

Compiling and Installing MLC-LLM

git clone --recursive git@github.com:mlc-ai/mlc-llm.git
cd mlc-llm/cmake
python3 gen_cmake_config.py

When executing python3 gen_cmake_config.py, you can choose the compilation options you need. For instance, in my case, I chose to enable CUDA, CUBLAS, and CUTLASS. Another thing to note is that the TVM_HOME path should be set to the path where Relax was compiled earlier.

Then, proceed with the following steps to compile:

cd ..
mkdir build
cp cmake/config.cmake build
cd build
cmake ..
make -j32

During the compilation, you'll also need to install Rust. Once the compilation is complete, the chat program mlc_chat_cli provided by mlc-llm will be installed. Additionally, for model conversion and quantization, you should also execute pip install . in the mlc-llm directory to install the mlc_llm package.

Model Conversion For model conversion, we primarily refer to this tutorial: https://mlc.ai/mlc-llm/docs/compilation/compile_models.html.

For instance, by executing python3 -m mlc_llm.build --hf-path StarRing2022/RWKV-4-World-7B --target cuda --quantization q4f16_1, we can quantize the weights of the RWKV-4-World-7B model to 4 bits, while the activation remains stored in FP16 format.

The target specifies the platform we intend to run on. This will compile the entire model graph into a dynamic link library (i.e., TVM's IRModule) for subsequent use by the mlc_chat_cli program (generated during the compilation of mlc-llm).

By default, a new dist/models directory will be created in the current directory to store the quantized model, configuration files, and the link library. After conversion and quantization, the model will be stored in the dist subdirectory of the current command directory (which will be created automatically). You can also manually clone the huggingface model into the dist/models directory. The structure of the quantized model is as follows:

The mlc-chat-config.json specifies some hyperparameters generated for the model, such as top_p, temperature, etc.

Lastly, before inference, we need to copy the tokenizer_model file, which we prepared initially, into this params directory.

Running Inference Execute the following command in the parent directory of mlc-llm:

./mlc-llm/build/mlc_chat_cli --model RWKV-4-World-7B-q0f16

You can replace RWKV-4-World-7B-q0f16 with the name you used during model quantization. After loading and running the system prompt, you can enjoy chatting with the RWKV-4-World model.

The program has some special commands to exit, check speed, etc.:

Performance Testing

Hardware	Quantization Method	Speed
A800	q0f16	prefill: 362.7 tok/s, decode: 72.4 tok/s
A800	q4f16_1	prefill: 1104.7 tok/s, decode: 122.6 tok/s

Here are two sets of performance data. If you're interested, you can test other configurations.

Layer-by-Layer Debugging Method

While adapting the 1.5B model, we encountered a phenomenon where the inference result was NaN. You can use the mlc-llm/tests/debug/dump_intermediate.py file to align the input and tokenizer results for debugging. This allows for precise simulation of model inference and printing of intermediate values for each layer, making it easy to identify at which layer the NaN occurred.

Deployment on Mac M2

Deploying on Mac and CUDA doesn't have significant differences. The main distinction is that when compiling relax and mlc-llm, you should now choose the Metal option instead of CUDA. I recommend handling the compilation issues within an Anaconda environment, rather than using the system's default Python environment.

When compiling relax, you need to enable both the Metal and LLVM options. If your system doesn't have LLVM, you can install it using Homebrew.

In mlc-llm, when generating the config.cmake, use the following options:

After compiling and running pip install ., use the command below to quantize the model:

python3 -m mlc_llm.build --hf-path StarRing2022/RWKV-4-World-3B --target metal --quantization q4f16_1

You also need to copy the tokenizer_model file to the params directory of the quantized model folder. Then, execute the following command to launch the chat program:

./mlc-llm/build/mlc_chat_cli --model RWKV-4-World-3B-q0f16

Lastly, here's a speed test for Mac M2:

Hardware	Quantization Method	Speed
Apple M2	q0f16	204.9 tok/s, decode: 12.1 tok/s
Apple M2	q4f16_1	prefill: 201.6 tok/s, decode: 26.3 tok/s

It's recommended to use the q4f16 configuration for faster responses.

Android

For Android, you can follow https://mlc.ai/mlc-llm/docs/deploy/android.html tutorial to compile apk in you phone.

Here are a few points that need to be modified according to the official tutorial:

Modify this file. The updated content should be:

{
  "model_libs": [
    "RWKV-4-World-3B-q4f16_1"
  ],
  "model_list": [
    {
      "model_url": "https://huggingface.co/BBuf/RWKV-4-World-3B-q4f16_1/",
      "local_id": "RWKV-4-World-3B-q4f16_1"
    }
  ],
  "add_model_samples": []
}

Modify this section of the code to:

compileOptions {
    sourceCompatibility JavaVersion.VERSION_17
    targetCompatibility JavaVersion.VERSION_17
}
kotlinOptions {
    jvmTarget = '17'
}

If you encounter the error: "Android Gradle plugin requires Java 17 to run. You are currently using Java 11", please follow the method provided in this document to clear the cache and recompile.

Once you have completed the APK compilation, you can enable developer mode on your phone and install the APK for use.

Using Xiaomi phones as an example, you can follow the tutorial below to enable developer mode and transfer the APK to your phone.

Step One: Enable USB Debugging on the Phone
- First, go to "Settings -> My Device -> All Specifications -> MIUI Version" on your phone and tap on "MIUI Version" seven times consecutively to enter developer mode.
- Next, navigate to "Settings -> Additional Settings -> Developer Options" and turn on both "USB Debugging" and "USB Installation".
Step Two: Configure Android Studio
- Open your Android Studio project, go to "Run -> Edit Configurations" as shown in the image below, and select "Open Select Deployment Target Dialog". This will prompt the device selection dialog every time you debug. Note: If you directly choose "USB Device", you might not be able to detect your phone during debugging.
Step Three: Online Debugging
- Connect your phone to the computer via USB. It will typically auto-install the necessary drivers. When you run the program, a device selection dialog will appear. Choose your phone, and the APK will automatically install and run.

An compiled apk: https://github.com/BBuf/run-rwkv-world-4-in-mlc-llm/releases/download/v1.0.0/app-debug.apk

BBuf / run-rwkv-world-4-in-mlc-llm

Run RWKV-World-4 in MLC-LLM Tutorial

Preparation

CUDA

Deployment on Mac M2

Android

Related Links

About