kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Home Page:https://kvcache-ai.github.io/ktransformers/

Repository from Github https://github.comkvcache-ai/ktransformersRepository from Github https://github.comkvcache-ai/ktransformers

[Bug] in support-qwen3next branch, transformers version has not high

PPXGS opened this issue · comments

检查清单

  • 1. 我已经搜索过相关问题,但未能获得预期的帮助
  • 2. 该问题在最新版本中尚未修复
  • 3. 请注意,如果您提交的BUG相关 issue 缺少对应环境信息和最小可复现示例,我们将难以复现和定位问题,降低获得反馈的可能性
  • 4. 如果您提出的不是bug而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
  • 5. 为方便社区交流,我将使用中文/英文或附上中文/英文翻译(如使用其他语言)。未附带翻译的非中文/英语内容可能会被关闭

问题描述

when build ktransformers in support-qwen3next, it is successful. But the transformers version can not match the qwen-3 next models' version.

复现步骤

Afer building, the ktransformers version is 0.3.2+cu124torch26fancy, and transformers version is 4.51.3. but Qwen3-Next-80B-A3B-Instruct need transformers version is "transformers_version": "4.57.0.dev0".
when I run
python ktransformers/server/main.py \ --port 10021 \ --model_path /localnvme/application/common/models/Qwen/Qwen3-Next-80B-A3B-Instruct \ --model_name Qwen3NextForCausalLM \ --optimize_config_path /localnvme/application/zhangzn/ktransformers_v0.3.2/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \ --max_new_tokens 1024 \ --cache_lens 32768 \ --chunk_size 256 \ --max_batch_size 4 \ --no-use_cuda_graph \ --backend_type balance_serve

I get the error:
ImportError: cannot import name 'layer_type_validation' from 'transformers.configuration_utils' (/localnvme/application/zhangzn/anaconda3/envs/ktransformers_support-qwen3next/lib/python3.11/site-packages/transformers/configuration_utils.py)

Image

环境信息

ktransformers 0.3.2+cu124torch26fancy
transformers 4.51.3
cuda 12.4
python 3.11
Ubuntu 20.04
GPU NVIDIA A800 ×8

pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers.git

python ktransformers/server/main.py --port 10021 --model_path /root/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic --gguf_path /root/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic --model_name Qwen3NextForCausalLM --backend_type balance_serve
W0922 09:30:59.383000 462160 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0922 09:30:59.383000 462160 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-09-22 09:30:59,385 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this.
set start method
Connected to server at tcp://localhost:37793
W0922 09:31:07.075000 462268 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0922 09:31:07.075000 462268 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-09-22 09:31:07,078 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this.
start method already set to spawn
Connected to server at tcp://localhost:37793
args.architectures: Qwen3NextForCausalLM
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Injecting model as default
Injecting model.embed_tokens as default
......
Injecting model.layers.47 as default
Injecting model.layers.47.self_attn as ktransformers.operators.balance_serve_attention . KQwen3NextAttention
Injecting model.layers.47.self_attn.q_proj as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.self_attn.k_proj as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.self_attn.v_proj as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.self_attn.o_proj as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.self_attn.q_norm as ktransformers.operators.layernorm . KQwen3NextRMSNorm
Injecting model.layers.47.self_attn.k_norm as ktransformers.operators.layernorm . KQwen3NextRMSNorm
Injecting model.layers.47.mlp as ktransformers.operators.experts . KQwen3NextSparseMoeBlockV2
Injecting model.layers.47.mlp.gate as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.mlp.experts as ktransformers.operators.experts . KTransformersExpertsV2
Injecting model.layers.47.mlp.shared_expert as ktransformers.operators.mlp . KQwen2MoeMLP
Injecting model.layers.47.mlp.shared_expert.gate_proj as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.mlp.shared_expert.up_proj as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.mlp.shared_expert.down_proj as ktransformers.operators.linear . KTransformersLinear
Injecting model.layers.47.mlp.shared_expert.act_fn as default
Injecting model.layers.47.mlp.shared_expert_gate as default
Injecting model.layers.47.input_layernorm as ktransformers.operators.layernorm . KQwen3NextRMSNorm
Injecting model.layers.47.post_attention_layernorm as ktransformers.operators.layernorm . KQwen3NextRMSNorm
Injecting model.norm as ktransformers.operators.layernorm . KQwen3NextRMSNorm
Injecting model.rotary_emb as ktransformers.operators.RoPE . KQwen3MoeRotaryEmbedding
Injecting cache as default
Injecting lm_head as ktransformers.operators.linear . KTransformersLinear
loading model.embed_tokens.weight to cpu
loading model.layers.0.linear_attn.dt_bias to cuda
loading model.layers.0.linear_attn.A_log to cuda
loading model.layers.0.linear_attn.conv1d.weight to cuda:0
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/anaconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/anaconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 308, in run_engine
engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 212, in init
optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf
load_weights(module, weights_loader, device=default_device)
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
[Previous line repeated 1 more time]
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 176, in load_weights
module.load()
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/base_operator.py", line 63, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 176, in load_weights
module.load()
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 944, in load
self.generate_linear.load(w=w)
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 653, in load
marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(
^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/marlin_utils.py", line 93, in marlin_quantize
q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 61, in quantize_weights
s = torch.max(torch.abs(w), 0, keepdim=True)[0]
^^^^^^^^^^^^
NotImplementedError: "abs_cuda" not implemented for 'Float8_e4m3fn'

那就不知道了,只知道用dev版transform依赖可以加载,其他低版本的transform我也试过,连正常加载都做不到。

你先可以跑了吗?我用原版qwen3-next是可以跑的。但 https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 版本报错

Injecting lm_head as ktransformers.operators.linear . KTransformersLinear
loading model.embed_tokens.weight to cpu
loading model.layers.0.linear_attn.dt_bias to cuda
loading model.layers.0.linear_attn.A_log to cuda
loading model.layers.0.linear_attn.conv1d.weight to cuda:0
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 308, in run_engine
    engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 212, in __init__
    optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf
    load_weights(module, weights_loader, device=default_device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  [Previous line repeated 1 more time]
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 944, in load
    self.generate_linear.load(w=w)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 638, in load
    self.bias = w[1].view(self.orin_out_features)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[12288]' is invalid for input of size 1536
^CTraceback (most recent call last):
  File "/home/ubuntu/ktransformers/ktransformers/server/main.py", line 122, in <module>
    main()
  File "/home/ubuntu/ktransformers/ktransformers/server/main.py", line 109, in main
    create_interface(config=cfg, default_args=cfg)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/utils/create_interface.py", line 30, in create_interface
    GlobalInterface.interface = BackendInterface(default_args)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 350, in __init__
    kvcache_event.wait()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/synchronize.py", line 356, in wait
    self._cond.wait(timeout)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/synchronize.py", line 268, in wait
    return self._wait_semaphore.acquire(True, timeout)

你先可以跑了吗?我用原版qwen3-next是可以跑的。但 https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 版本报错

Injecting lm_head as ktransformers.operators.linear . KTransformersLinear
loading model.embed_tokens.weight to cpu
loading model.layers.0.linear_attn.dt_bias to cuda
loading model.layers.0.linear_attn.A_log to cuda
loading model.layers.0.linear_attn.conv1d.weight to cuda:0
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 308, in run_engine
    engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 212, in __init__
    optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf
    load_weights(module, weights_loader, device=default_device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  [Previous line repeated 1 more time]
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 944, in load
    self.generate_linear.load(w=w)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 638, in load
    self.bias = w[1].view(self.orin_out_features)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[12288]' is invalid for input of size 1536
^CTraceback (most recent call last):
  File "/home/ubuntu/ktransformers/ktransformers/server/main.py", line 122, in <module>
    main()
  File "/home/ubuntu/ktransformers/ktransformers/server/main.py", line 109, in main
    create_interface(config=cfg, default_args=cfg)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/utils/create_interface.py", line 30, in create_interface
    GlobalInterface.interface = BackendInterface(default_args)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 350, in __init__
    kvcache_event.wait()
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/synchronize.py", line 356, in wait
    self._cond.wait(timeout)
  File "/home/ubuntu/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/synchronize.py", line 268, in wait
    return self._wait_semaphore.acquire(True, timeout)

@harveyff 原版的跑的很慢基本不动,你那边可以正常跑吗?