kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Home Page:https://kvcache-ai.github.io/ktransformers/

Repository from Github https://github.comkvcache-ai/ktransformersRepository from Github https://github.comkvcache-ai/ktransformers

is this project dead?

devops724 opened this issue · comments

Hi , i long time awating for update to new cuda driver support, but last commit is around 2 months ago, is there any alternative allow me run on system with qwen 235B on 16G VRAM and 512G RAM in latest debian 13, cuda and nvidia driver?

You can try ik_llama.cpp

You can try ik_llama.cpp

i can't find a way to compile this using cuda 12.8 in linux too,
rtx 50 series require minimum cuda 12.8

You can try ik_llama.cpp

which does not use AMX in CPU+GPU hybrid unless something changed in the past few weeks. They are still using the upstream engine from lllama.cpp which is hard coded to use AVX512 when a GPU is present and will only use AMX in CPU only operations.

You can try ik_llama.cpp

i can't find a way to compile this using cuda 12.8 in linux too, rtx 50 series require minimum cuda 12.8

I had no issues with Ubuntu 24.04 LTS, Cuda 12.8, and driver open 575-server

True, ik doesn't have any specific AMX optimization, but going by the benchmark results from this repo (twhich may be outdated) I don't really see it outperforming ik in neither prefill nor generation. I wish I could be proven wrong since I could personally benefit from it.

You can try ik_llama.cpp

which does not use AMX in CPU+GPU hybrid unless something changed in the past few weeks. They are still using the upstream engine from lllama.cpp which is hard coded to use AVX512 when a GPU is present and will only use AMX in CPU only operations.

ik_llama does not use the "upstream inference engine". If it did, what would be the point of it? Although it does not use AMX-specific optimisations, it is still significantly faster than the llama.cpp and KTransformers AMX implementations, at least as far as I can tell.

Any plans to implement AMX then?

I am a big fan of IK_llama.cpp; but have not ran it in a long time.

I could try to do a run of both.

You can try ik_llama.cpp

which does not use AMX in CPU+GPU hybrid unless something changed in the past few weeks. They are still using the upstream engine from lllama.cpp which is hard coded to use AVX512 when a GPU is present and will only use AMX in CPU only operations.

ik_llama does not use the "upstream inference engine". If it did, what would be the point of it? Although it does not use AMX-specific optimisations, it is still significantly faster than the llama.cpp and KTransformers AMX implementations, at least as far as I can tell.

I have a fork of llama.cpp they enables AMX in hybrid environments, grants about a 20-30% increase in the CPU inference portions of offloaded models if you want to try it.

True, ik doesn't have any specific AMX optimization, but going by the benchmark results from this repo (twhich may be outdated) I don't really see it outperforming ik in neither prefill nor generation. I wish I could be proven wrong since I could personally benefit from it.

I pulled the AMXInt4 working fork of Ktransformers and saw some meaningful improvements. Would love to see these be implemented in ik_llama too

#1492 (comment)

I have a fork of llama.cpp they enables AMX in hybrid environments, grants about a 20-30% increase in the CPU inference portions of offloaded models if you want to try it.

True, ik doesn't have any specific AMX optimization, but going by the benchmark results from this repo (twhich may be outdated) I don't really see it outperforming ik in neither prefill nor generation. I wish I could be proven wrong since I could personally benefit from it.

Hi, I tried your fork and the uplift wasn't noticeable.

prompt eval time = 1016.33 ms / 16 tokens ( 63.52 ms per token, 15.74 tokens per second)
eval time = 149323.75 ms / 1162 tokens ( 128.51 ms per token, 7.78 tokens per second)

I used the following flags, as AMX specific flags aren't documented anywhere, but some posts mentioned the GGML_NATIVE flag being enough..

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON

I also used an IQ4_XL quant. Would that have affected the results?

In any case, I think we should continue this conversation on that repo. Can you enable the discussions feature?

If you are running a hybrid CPU/GPU setup, try my llama.cpp fork, it works
with amx int8/bf16 (I don’t have a 6th gen to try Int4).

https://github.com/Gadflyii/llama.cpp

Build as usual with all the AMX flags, and run llama-bench/cli/server with
“—amx” to enable amx in a hybrid.

Should see 30-40% uplift in the CPU offloaded layers / experts

@devops724
Yes, this project is dead. The underlying fork of flalshinfer that ktransformers is using doens't work corrrectly. Even more so, if you will try to port the latest master of flashinfer to be usable with ktransformers you will hit the same issues. ktransformers is an example of a project that should be avoided at all costs. Speed is nothing if the output is a garbage.

What doesn’t work correctly?

I don’t think it is fair to say the project is dead, they made a major commit just last week to add support for qwen-next.

@Gadflyii

What doesn’t work correctly?

The long context for any MoE model for sure.

[EDIT]: take a look over here: #1417 (comment)
Specifically, the ktransformers MIGHT work for 10-20k context. But if you will try to use it for longer context -- it will fail in unpredictable ways. That's the fact. The ktransformers just doesn't work, because it relies on flashinfer which has so much bugs its insane.

Long Context over ~80K seems to cause instabilities, but until then I have had positive experiences with this package. I am very much looking forward to the AMXInt4 backend as the improvements as an order of magnitude above everything else in terms of PP. The only problem is the novel approach which prevents devs like me to dig into the weeds of the project and keep improving.

@Gadflyii

I don’t think it is fair to say the project is dead, they made a major commit just last week to add support for qwen-next.

It doesn't matter what model this project supports. Its core functionality is unstable. And no one wants to fix it. The custom_flashinfer is a relic of the unstable flashinfer.

@trilog-inc

Long Context over ~80K seems to cause instabilities, but until then I have had positive experiences with this package.

You do know that even if [it] DOESN'T output a "garbage", you CAN'T know for sure if the result are valid. You know why? Because ktransformers DOESN'T support seeds. That is, EACH TIME you HAVE TO BELIEVE that the bug never occured. I can't imagine how that might feel comfortable.

I get it. I've followed your adventure through this package and tend to agree. Just speaking on behalf of my experience.

Edit: I run ik_llama and @Gadflyii llama.cpp AMX fork regularly because of these concerns, but I want to see ktransformers succeed aswell

@trilog-inc

I am very much looking forward to the AMXInt4 backend as the improvements as an order of magnitude above everything else in terms of PP.

In terms of Prompt Processing (that is, the prefill?).
But how the AMX at the CPU would increase the prefill? I got your point, its always great to have some Intel Xeon QYFS for $150 from ebay. I am building the machine right now with 56C and it does support the AMX so I am interested too. But how does it compare to multiple GPUs? Isn't it easier to plug in more GPUs to increase the prefill? I am having around 120tps with ik_llama.cpp with a really heavy (6.2bpw) quant of Deepseek R1 for example with only three GPUs. Are you saying that some trick with AMX can push the prefill to 1000tps? Are you sure?

I can only speak on my own experience as I don't have the hardware to test their best case runs, but with a 24C W7-3455 with 512GB DDR5 (4800) and 1x 4090 and 1x3090, I get a prefill speed of >140T/s and decode of 12T/s at 800 tokens, and that decreases to about 120T/s and 10.5T/s at 30K tokens ( AMXInt4 preview in the SOSP branch).
Their published results ( https://github.com/kvcache-ai/ktransformers/blob/sosp25-ae/sosp25-ae/Figure11-prefill/full_run/reference_figure11.pdf ) show them reaching ~400+T/S.

Sure, additional GPUs can do it, but when youre dealing with these large MOE it doesn't become practical ( PSU and $$/GB of memory ). The prompt processing speed is limited by CPU memory bandwidth, so any edge on CPU bound computations yield big improvements.

Can you share your ik_llama command? Does this work for the full 128K context? I have never been able to get much more than 40T/s and I have 4 GPUs. I also can'y break much more than 10t/s decode with ik at low context.

I hope they fix it; this is a really cool project.

@trilog-inc

Can you share your ik_llama command? Does this work for the full 128K context?

128k IS NOT a "full context". Deepseek R1/V3 supports 160k. To handle that with decent batches you have to have about 72GB of VRAM.
The command is here: ikawrakow/ik_llama.cpp#477 (reply in thread)
But that was some time ago. Now, with some new drivers, CUDA, some CPU and GPU tweaks I am getting slightly better results for a first 4k context. Specifically, 119.72tps in prefill and 6.73 tps in decode (6.2bpw quant).

export MALLOC_CONF="background_thread:true,percpu_arena:phycpu,metadata_thp:auto,dirty_decay_ms:10000,muzzy_decay_ms:60000"
export LD_PRELOAD=/usr/local/lib/libjemalloc.so

#    --seed 3407 \
#    -fmoe \
      
ulimit -n 9999
ulimit -l unlimited

export CUDA_VISIBLE_DEVICES="0,1,2"
#export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
    --model /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/DeepSeek-R1-0528.ROOT-6.2478bpw/DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01148.gguf \
    --alias THIREUS/DeepSeek-R1-0528-6.2478bpw \
    --ctx-size $((160 * 1024)) \
    -b $((16 * 512)) -ub $((8 * 512)) \
    --mlock \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    -ctk q8_0 \
    -mla 3 -fa \
    -fmoe \
    -amb 512 \
    --split-mode layer \
    --tensor-split 10,21,20 \
    --main-gpu 1 \
    --override-tensor exps=CPU \
    --n-gpu-layers 99 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --log-enable \
    --logdir /var/log/ \
    --jinja \
    --special \
    --verbose-prompt --verbosity 2 \
                --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
                --keep -1 \
                --slot-prompt-similarity 0.35 \
    --metrics

I have never been able to get much more than 40T/s and I have 4 GPUs.

Well, you are doing something wrong then. Let's compare.

ik_llama.cpp:

#!/usr/bin/env bash
cd ik_llama.cpp
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES="86" \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=1 \
  -DGGML_SCHED_MAX_COPIES=1 \
  -DGGML_CUDA_IQK_FORCE_BF16=1 \
  -DGGML_MAX_CONTEXTS=2048 \
  -DGGML_VULKAN=OFF \
  -DGGML_CUDA_F16=ON
cmake --build build --config Release -j $(nproc)

nvidia-smi:

Tue Sep 23 19:56:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.09              Driver Version: 580.82.09      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:41:00.0 Off |                  N/A |
| 67%   78C    P2            186W /  400W |   23398MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:42:00.0 Off |                  N/A |
| 30%   59C    P2            168W /  400W |   23354MiB /  24576MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:61:00.0 Off |                  N/A |
| 63%   76C    P2            200W /  400W |   23626MiB /  24576MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

nvidia make.sh:

#!/usr/bin/env bash

apt install openssl

mkdir -p /lib/modules/$(uname -r)/build/certs
cd /lib/modules/$(uname -r)/build/certs

sudo tee x509.genkey > /dev/null << 'EOF'
[ req ]
default_bits = 4096
distinguished_name = req_distinguished_name
prompt = no
string_mask = utf8only
x509_extensions = myexts
[ req_distinguished_name ]
CN = Modules
[ myexts ]
basicConstraints=critical,CA:FALSE
keyUsage=digitalSignature
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid
EOF
openssl req -new -nodes -utf8 -sha512 -days 36500 -batch -x509 -config x509.genkey -outform DER -out signing_key.x509 -keyout signing_key.pem
ln -fs /lib/modules/$(uname -r)/build/certs /usr/src/linux-headers-$(uname -r | cut -d'-' -f1)-common/

cd -

apt -y install -f ./cuda-keyring_1.1-1_all.deb
apt -y update

#./cuda_13.0.0_580.65.06_linux.run
apt -y install cuda-13-0

apt install --reinstall -y cudnn libglx-nvidia0
./NVIDIA-Linux-x86_64-580.82.09.run
#ln -rs /usr/src/linux-headers-6.12.41+deb13-common/certs /usr/src/linux-headers-6.12.41+deb13-common/output
# https://github.com/aikitoria/open-gpu-kernel-modules
cd open-gpu-kernel-modules/
export IGNORE_CC_MISMATCH=1
./install.sh

also can'y break much more than 10t/s decode with ik at low context.

The decode speed purely depends on RAM bandwidth, so the only option is to use optimal quants. Check the graphs over here: ikawrakow/ik_llama.cpp#715

For example, I'd been using this one: THIREUS-3.5652. So its slightly more than 256GB in size and, with Lenovo Thinkstation P620 (64C THREADRIPPER PRO, DDR4 3200, two RTX 3090) I am getting something very close to 10 tps in decode (actually, let me upgrade the software and retest). If you are using DDR5 you should get some better results then.

@trilog-inc

Their published results ( https://github.com/kvcache-ai/ktransformers/blob/sosp25-ae/sosp25-ae/Figure11-prefill/full_run/reference_figure11.pdf ) show them reaching ~400+T/S.

The thing to keep in mind is that you can never be sure if the result[s] of the ktransformers are valid. The numbers doesn't mean anything if you can never be sure what is true or what is false.

@magikRUKKOLA There is alot to digest here. Thanks!
Simply using your command didn't yield any improvements so i suspect it is mostly in the quants. I will dig into all that.

Check out https://github.com/guqiong96/lktransformers There might be some additional improvements there

now i can run in ARM: #1529

@johnnynunez

now i can run in ARM: #1529

How the long context support is doing? Any LLM would output garbage after 64k ctx with probability of about 30-40%?