[Bug] Cutlass not enabled on convert_weight q4f16_ft
Erxl opened this issue Β· comments
Erxl commented
π Bug
To Reproduce
Steps to reproduce the behavior:
1.mlc_llm convert_weight llm/Mistral-Large-Instruct-2407 --quantization q4f16_ft -o m --device vulkan
(mlcllm) a@aserver:~$ mlc_llm convert_weight llm/Mistral-Large-Instruct-2407 --quantization q4f16_ft -o m --device vulkan
[2024-08-25 12:39:16] INFO auto_config.py:116: Found model configuration: llm/Mistral-Large-Instruct-2407/config.json
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:0
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:1
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:2
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:3
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:4
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:5
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:6
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:7
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:8
[2024-08-25 12:39:24] INFO auto_weight.py:71: Finding weights in: llm/Mistral-Large-Instruct-2407
[2024-08-25 12:39:24] INFO auto_weight.py:137: Not found Huggingface PyTorch
[2024-08-25 12:39:24] INFO auto_weight.py:144: Found source weight format: huggingface-safetensor. Source configuration: llm/Mistral-Large-Instruct-2407/model.safe
[2024-08-25 12:39:24] INFO auto_weight.py:107: Using source weight configuration: llm/Mistral-Large-Instruct-2407/model.safetensors.index.json. Use `--source` to o
[2024-08-25 12:39:24] INFO auto_weight.py:111: Using source weight format: huggingface-safetensor. Use `--source-format` to override.
[2024-08-25 12:39:24] INFO auto_config.py:154: Found model type: mistral. Use `--model-type` to override.
Weight conversion with arguments:
--config llm/Mistral-Large-Instruct-2407/config.json
--quantization FTQuantize(name='q4f16_ft', kind='ft-quant', quantize_dtype='int4', storage_dtype='int8', model_dtype='float16', group_size=None, num_elem_per_
--model-type mistral
--device vulkan:0
--source llm/Mistral-Large-Instruct-2407/model.safetensors.index.json
--source-format huggingface-safetensor
--output m
[2024-08-25 12:39:24] INFO mistral_model.py:59: context_window_size not found in config.json. Falling back to max_position_embeddings (131072)
[2024-08-25 12:39:24] INFO mistral_model.py:87: prefill_chunk_size defaults to 2048
[2024-08-25 12:39:24] INFO ft_quantization.py:140: Fallback to GroupQuantize for nn.Linear: "lm_head", weight.shape: [vocab_size, 12288], out_dtype: None
Start storing to cache m
[2024-08-25 12:39:30] INFO huggingface_loader.py:185: Loading HF parameters from: llm/Mistral-Large-Instruct-2407/model-00051-of-00051.safetensors
[2024-08-25 12:39:34] INFO group_quantization.py:218: Compiling quantize function for key: ((32768, 12288), float16, vulkan, axis=1, output_transpose=False)
[2024-08-25 12:39:34] INFO huggingface_loader.py:167: [Quantized] Parameter: "lm_head.q_weight", shape: (32768, 1536), dtype: uint32
[2024-08-25 12:39:35] INFO huggingface_loader.py:167: [Quantized] Parameter: "lm_head.q_scale", shape: (32768, 384), dtype: float16
[2024-08-25 12:39:35] INFO huggingface_loader.py:175: [Not quantized] Parameter: "model.layers.87.input_layernorm.weight", shape: (12288,), dtype: float16
0%|β
Traceback (most recent call last):
File "/home/a/miniconda3/envs/mlcllm/bin/mlc_llm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/__main__.py", line 37, in main
cli.main(sys.argv[2:])
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/cli/convert_weight.py", line 88, in main
convert_weight(
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/interface/convert_weight.py", line 181, in convert_weight
_convert_args(args)
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/interface/convert_weight.py", line 145, in _convert_args
tvmjs.dump_ndarray_cache(
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/tvm/contrib/tvmjs.py", line 273, in dump_ndarray_cache
for k, origin_v in param_generator:
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/interface/convert_weight.py", line 129, in _param_generator
for name, param in loader.load(device=args.device, preshard_funcs=preshard_funcs):
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/loader/huggingface_loader.py", line 121, in load
for name, loader_param in self._load_or_quantize(mlc_name, param, device):
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/loader/huggingface_loader.py", line 164, in _load_or_quantize
q_params = self.quantize_param_map.map_func[mlc_name](param)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/quantization/ft_quantization.py", line 180, in quantize_weight
assert tvm.get_global_func("relax.ext.cutlass", True), (
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cutlass should be enabled in TVM runtime to quantize weight, but not enabled in current TVM runtime environment. To enable Cutlass in TVM runtime, ig.cmake when compiling TVM from source
Expected behavior
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA):rocm6.2
- Operating system (e.g. Ubuntu/Windows/MacOS/...):ubuntu22.04
- Device 7900xtx
- How you installed MLC-LLM (
conda
, source):python prebuilt package - How you installed TVM-Unity (
pip
, source): - Python version (e.g. 3.10):3.11
- GPU driver version (if applicable):
- CUDA/cuDNN version (if applicable):
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): - Any other relevant information:
Additional context
Ruihang Lai commented
HI @Erxl, the q4f16_ft
quantization is only available when you use --device cuda
. So you likely need to try again with that. If you are not using NVIDIA GPUs then unfortunately the faster transformer quantization is not available.
Erxl commented
@MasterJH5574 Is q4f16_ft inference available on AMD or Vulkan?
Ruihang Lai commented
Is q4f16_ft inference available on AMD or Vulkan?
@Erxl No, FasterTransformer is developed by NVIDIA https://github.com/NVIDIA/FasterTransformer