qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Syntax changed in triton.testing.do_bench() causing error when running llama_inference.py

prasanna opened this issue · comments

Got this error when running llama_inference.py:

$ CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
  0%|                                                                                                             | 0/12 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 72, in _bench
    return triton.testing.do_bench(kernel_call, percentiles=(0.5, 0.2, 0.8), rep=40)
TypeError: do_bench() got an unexpected keyword argument 'percentiles'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/llama_inference.py", line 110, in <module>
    model = load_quant(args.model, args.load, args.wbits, args.groupsize, fused_mlp=args.fused_mlp)
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/llama_inference.py", line 66, in load_quant
    quant.autotune_warmup_linear(model, transpose=not (eval))
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/quant_linear.py", line 419, in autotune_warmup_linear
    matmul248(a, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/quant_linear.py", line 267, in matmul248
    matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in run
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 73, in _bench
    except triton.compiler.OutOfResources:
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

The issue is in quant/custom_autotune.py:72. The param percentiles has been changed to quantiles in triton.testing.do_bench()