mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

Home Page:https://llm.mlc.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] Cutlass not enabled on convert_weight q4f16_ft

Erxl opened this issue Β· comments

commented

πŸ› Bug

To Reproduce

Steps to reproduce the behavior:

1.mlc_llm convert_weight llm/Mistral-Large-Instruct-2407 --quantization q4f16_ft -o m --device vulkan

(mlcllm) a@aserver:~$ mlc_llm convert_weight llm/Mistral-Large-Instruct-2407 --quantization q4f16_ft -o m --device vulkan
[2024-08-25 12:39:16] INFO auto_config.py:116: Found model configuration: llm/Mistral-Large-Instruct-2407/config.json
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:0
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:1
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:2
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:3
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:4
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:5
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:6
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:7
[2024-08-25 12:39:24] INFO auto_device.py:79: Found device: vulkan:8
[2024-08-25 12:39:24] INFO auto_weight.py:71: Finding weights in: llm/Mistral-Large-Instruct-2407
[2024-08-25 12:39:24] INFO auto_weight.py:137: Not found Huggingface PyTorch
[2024-08-25 12:39:24] INFO auto_weight.py:144: Found source weight format: huggingface-safetensor. Source configuration: llm/Mistral-Large-Instruct-2407/model.safe
[2024-08-25 12:39:24] INFO auto_weight.py:107: Using source weight configuration: llm/Mistral-Large-Instruct-2407/model.safetensors.index.json. Use `--source` to o
[2024-08-25 12:39:24] INFO auto_weight.py:111: Using source weight format: huggingface-safetensor. Use `--source-format` to override.
[2024-08-25 12:39:24] INFO auto_config.py:154: Found model type: mistral. Use `--model-type` to override.
Weight conversion with arguments:
  --config          llm/Mistral-Large-Instruct-2407/config.json
  --quantization    FTQuantize(name='q4f16_ft', kind='ft-quant', quantize_dtype='int4', storage_dtype='int8', model_dtype='float16', group_size=None, num_elem_per_
  --model-type      mistral
  --device          vulkan:0
  --source          llm/Mistral-Large-Instruct-2407/model.safetensors.index.json
  --source-format   huggingface-safetensor
  --output          m
[2024-08-25 12:39:24] INFO mistral_model.py:59: context_window_size not found in config.json. Falling back to max_position_embeddings (131072)
[2024-08-25 12:39:24] INFO mistral_model.py:87: prefill_chunk_size defaults to 2048
[2024-08-25 12:39:24] INFO ft_quantization.py:140: Fallback to GroupQuantize for nn.Linear: "lm_head", weight.shape: [vocab_size, 12288], out_dtype: None
Start storing to cache m
[2024-08-25 12:39:30] INFO huggingface_loader.py:185: Loading HF parameters from: llm/Mistral-Large-Instruct-2407/model-00051-of-00051.safetensors                 
[2024-08-25 12:39:34] INFO group_quantization.py:218: Compiling quantize function for key: ((32768, 12288), float16, vulkan, axis=1, output_transpose=False)       
[2024-08-25 12:39:34] INFO huggingface_loader.py:167: [Quantized] Parameter: "lm_head.q_weight", shape: (32768, 1536), dtype: uint32                               
[2024-08-25 12:39:35] INFO huggingface_loader.py:167: [Quantized] Parameter: "lm_head.q_scale", shape: (32768, 384), dtype: float16                                
[2024-08-25 12:39:35] INFO huggingface_loader.py:175: [Not quantized] Parameter: "model.layers.87.input_layernorm.weight", shape: (12288,), dtype: float16         
  0%|β–Œ                                                                                                                                                             
Traceback (most recent call last):
  File "/home/a/miniconda3/envs/mlcllm/bin/mlc_llm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/__main__.py", line 37, in main
    cli.main(sys.argv[2:])
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/cli/convert_weight.py", line 88, in main
    convert_weight(
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/interface/convert_weight.py", line 181, in convert_weight
    _convert_args(args)
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/interface/convert_weight.py", line 145, in _convert_args
    tvmjs.dump_ndarray_cache(
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/tvm/contrib/tvmjs.py", line 273, in dump_ndarray_cache
    for k, origin_v in param_generator:
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/interface/convert_weight.py", line 129, in _param_generator
    for name, param in loader.load(device=args.device, preshard_funcs=preshard_funcs):
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/loader/huggingface_loader.py", line 121, in load
    for name, loader_param in self._load_or_quantize(mlc_name, param, device):
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/loader/huggingface_loader.py", line 164, in _load_or_quantize
    q_params = self.quantize_param_map.map_func[mlc_name](param)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/mlc_llm/quantization/ft_quantization.py", line 180, in quantize_weight
    assert tvm.get_global_func("relax.ext.cutlass", True), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cutlass should be enabled in TVM runtime to quantize weight, but not enabled in current TVM runtime environment. To enable Cutlass in TVM runtime, ig.cmake when compiling TVM from source

Expected behavior

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA):rocm6.2
  • Operating system (e.g. Ubuntu/Windows/MacOS/...):ubuntu22.04
  • Device 7900xtx
  • How you installed MLC-LLM (conda, source):python prebuilt package
  • How you installed TVM-Unity (pip, source):
  • Python version (e.g. 3.10):3.11
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

HI @Erxl, the q4f16_ft quantization is only available when you use --device cuda. So you likely need to try again with that. If you are not using NVIDIA GPUs then unfortunately the faster transformer quantization is not available.

commented

@MasterJH5574 Is q4f16_ft inference available on AMD or Vulkan?

Is q4f16_ft inference available on AMD or Vulkan?

@Erxl No, FasterTransformer is developed by NVIDIA https://github.com/NVIDIA/FasterTransformer