qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA out of memory on flan-ul2

sigmareaver opened this issue · comments

Tested on 4090
Using command:
python t5.py ../full-models/flan-ul2 c4 --wbits 4 --act-order --groupsize 128 --save ../gptq-models/flan-ul2-gptq/flan-ul2-4bit-128g-gptq.pt
What is the memory requirement for quantizing a 20b model? I thought it should only need one layer at a time on GPU?

Was able to quantize using --nsamples 256 and hacking a part of the code in t5_sequential, the part about final layer norms and dropout, to be run on CPU.