quduoduo / RWKV-Infer

A large-scale RWKV v6 inference wrapper using the Cuda backend. Easy to deploy on docker. Supports multi-batch generation and dynamic State switching. Let's spread RWKV, which combines RNN technology with impressively low inference costs!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RWKV-Infer

A large-scale RWKV v6 inference engine using the Cuda backend. Supports multi-batch generation and dynamic State switching.

This project aims to simplify the deployment of RWKV model inference in a Docker

The following features are included:

  • Support for multi-batch generation and stream delivery
  • State switching for each batch
  • OpenAI-compatible API
  • Dynamic RNN State Cache(20240610) By dynamically caching RNN states, we have improved the efficiency of state regeneration frequency and accelerated inference speed.

How To Use

    1. Install Latest Pytorch with Cuda(2.2+ tested)
    1. install requirements
pip install -r requirements.txt
    1. prepare models in models folder
    1. prepare states in states folder
    1. Run Server
python rwkv_server.py --localhost 0.0.0.0 --port 8000 --debug False --workers 16 --dynamic_state_cache_size 64
    1. Load Model
curl http://127.0.0.1:8000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"cuda fp16"}'
    1. Enjoy Infernce via OpenAI Compatible API!

API Examples

    1. Model Load
curl http://127.0.0.1:8000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"cuda fp16"}'
    1. Add State
curl http://127.0.0.1:8000/loadstatemodel -X POST -H "Content-Type: application/json" -d '{"state_filename":"state.pth","state_viewname":"State Test"}'
    1. Remove All State
curl http://127.0.0.1:8000/removestatemodel -X POST -H "Content-Type: application/json" -d '{"dummy":"dummy"}'
    1. Get Model Names (During inference, setting the same name as this ID will enable dynamic state loading.)
curl http://127.0.0.1:8000/models -X GET

ToDo for me

  • (Done)Dynamic State Cache for faster inference
  • Dynamic Swap LoRA(Torch Compile.......)
  • RAG(Cold RAG)
  • Research 4bit inference with 4bit matmul

About

A large-scale RWKV v6 inference wrapper using the Cuda backend. Easy to deploy on docker. Supports multi-batch generation and dynamic State switching. Let's spread RWKV, which combines RNN technology with impressively low inference costs!

License:Apache License 2.0


Languages

Language:Python 99.6%Language:Shell 0.4%