GPTQ quantization not working

Question

GPTQ quantization not working

lopuhin opened this issue 6 months ago · comments

Konstantin Lopuhin commented 6 months ago

Running quantize.py with --mode int4-gptq does not seem to work:

code tries to import lm-evaluation-harness which is not included/documented/used
import in eval.py is incorrect, should probably be from model import Transformer as LLaMA instead of from model import LLaMA
after fixing two above issues, next one is a circular import
after fixing that, import lm_eval should be replaced with import lm_eval.base
there is one other circular import
there are a few other missing imports from lm_eval
and a few other errors

Overall here are the fixes I had to apply to make it run: lopuhin@86d990b

Based on this, could you please check if the right version of the code was included for GPTQ quantization?

Konstantin Lopuhin · Answer 1 · Fri Dec 01 2023 20:52:18 GMT+0800 (China Standard Time)

One more issue is very high memory usage, it exceeds 128 GB after processing only the first 9 layers with the 13b model.

James Whedbee · Answer 2 · Sat Dec 02 2023 02:49:44 GMT+0800 (China Standard Time)

I am at the third bullet point here as well, going to just follow along to comments here

Konstantin Lopuhin · Answer 3 · Sat Dec 02 2023 02:53:25 GMT+0800 (China Standard Time)

@jamestwhedbee to get rid of those python issues you can try to use this fork in the meantime https://github.com/lopuhin/gpt-fast/ -- but I don't have a solution for high RAM usage yet, so in the end I didn't manage to get a converted model.

James Whedbee · Answer 4 · Sat Dec 02 2023 04:56:11 GMT+0800 (China Standard Time)

That looked promising but I unfortunately ran into another issue you probably wouldn't have. I am on AMD so that might be the cause. I can't find anything online related to this issue. I noticed that non-GPTQ int4 quantization does not work for me either, with the same error. int8 quantization works fine and I have run GPTQ int4 quantized models using the auto-gptq library for ROCm before so not sure what this issue is.

Traceback (most recent call last):
  File "/home/telnyxuser/gpt-fast/quantize.py", line 614, in <module>
    quantize(args.checkpoint_path, args.model_name, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 560, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 423, in create_quantized_state_dict
    weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
  File "/home/telnyxuser/gpt-fast/quantize.py", line 358, in prepare_int4_weight_and_scales_and_zeros
    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/_ops.py", line 753, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

Konstantin Lopuhin · Answer 5 · Sat Dec 02 2023 04:57:58 GMT+0800 (China Standard Time)

I got the same error when trying a conversion on another machine with more RAM but an older NVIDIA GPU.

Dev Goel · Answer 6 · Mon Dec 04 2023 19:15:35 GMT+0800 (China Standard Time)

anyone solved all the problem. i am getting all the problem discussed in this thread

Dev Goel · Answer 7 · Mon Dec 04 2023 19:51:22 GMT+0800 (China Standard Time)

@jamestwhedbee @lopuhin i stuck on this
Traceback (most recent call last):
File "quantize.py", line 614, in
quantize(args.checkpoint_path, args.model_name, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "quantize.py", line 560, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
File "/root/development/dev/venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "quantize.py", line 423, in create_quantized_state_dict
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
File "quantize.py", line 358, in prepare_int4_weight_and_scales_and_zeros
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
File "/root/development/dev/venv/lib/python3.8/site-packages/torch/_ops.py", line 753, in call
return self._op(*args, **kwargs or {})
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

are you guys able to solve this?

Konstantin Lopuhin · Answer 8 · Mon Dec 04 2023 19:53:26 GMT+0800 (China Standard Time)

RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

@MrD005 I got this error when trying to run on 2080Ti but not on L4 (both using CUDA 12.1) so I suspect this is due to this function missing in lower compute capability.

Dev Goel · Answer 9 · Mon Dec 04 2023 20:19:32 GMT+0800 (China Standard Time)

@lopuhin i am running it on A100 , python 3.8 , with cuda 11.8 nightly so i think it is not about lower compute capability

CHU Tianxiang · Answer 10 · Mon Dec 04 2023 20:47:05 GMT+0800 (China Standard Time)

According to the code here, probably both cuda 12.x and compute capability 8.0+ are required.

Brian Williams · Answer 11 · Thu Dec 07 2023 09:11:06 GMT+0800 (China Standard Time)

I had the same _convert_weight_to_int4pack_cuda not available problem. It was due to Cuda 11.8 not supporting the operator. Works now with a RTX4090 and 12.1

Xin Li · Answer 12 · Thu Dec 07 2023 17:54:41 GMT+0800 (China Standard Time)

I got this problem on my single RTX4090 with Pytorch nightly installed with Cuda 11.8. After I had switched to Pytorch nightly on CUDA12.1, the problem was gone.

Lukas S · Answer 13 · Sun Jan 07 2024 21:00:37 GMT+0800 (China Standard Time)

@jamestwhedbee did you find a solution for ROCm?

James Whedbee · Answer 14 · Mon Jan 08 2024 23:00:35 GMT+0800 (China Standard Time)

@lufixSch no, but as of last week v0.2.7 of vLLM supports GPTQ with ROCm, and I am seeing pretty good results there. So maybe that is an option for you.

ce1190222 · Answer 15 · Fri Feb 02 2024 14:13:10 GMT+0800 (China Standard Time)

I applied all the fixes mentioned. But I'm still getting this error:-
File "/kaggle/working/quantize.py", line 14, in
from GPTQ import GenericGPTQRunner, InputRecorder
File "/kaggle/working/GPTQ.py", line 12, in
from eval import setup_cache_padded_seq_input_pos_max_seq_length_for_prefill
File "/kaggle/working/eval.py", line 20, in
import lm_eval.base
ModuleNotFoundError: No module named 'lm_eval.base'

I am using lm_eval 0.4.0

Jerry Zhang · Answer 16 · Thu Feb 08 2024 07:17:34 GMT+0800 (China Standard Time)

lm_eval 0.3.0 and 0.4.0 support is updated in eb1789b