qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

qwopqwop200/GPTQ-for-LLaMa Issues

add support for minicpm
Updated 3 months ago
GPTQ vs bitsandbytes
Updated 6 months ago
Error when load GPTQ model
Updated 7 months ago
Dependency conflicts for `safetensors`
Closed 7 months ago1
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}
Updated 8 months ago1
Syntax changed in triton.testing.do_bench() causing error when running llama_inference.py
Updated 9 months ago
_pickle.UnpicklingError: invalid load key, 'v'.
Updated 10 months ago1
inference with the saved model error: AttributeError: module 'torch.backends.cuda' has no attribute 'sdp_kernel'
Updated 10 months ago2
Porting GPTQ to CPU?
Updated 10 months ago2
the inference speed of GPTQ 4bit quantized model
Updated 10 months ago2
Support Mistral.
Updated a year ago
error: block with no terminator, has llvm.cond_br %5624, ^bb2, ^bb3
Updated a year ago
neox.py needs to add "import math"
Updated a year ago
LoRa and diff with bitsandbytes
Updated a year ago
Transformers broke again (AttributeError: 'GPTQ' object has no attribute 'inp1')
Updated a year ago1
Would GPTQ be able to support LLaMa2?
Updated a year ago1
Can i quantize HF version of llama model
Updated a year ago
Why does the model quantization prompt KILLED at the end?
Updated a year ago2
Help: Quantized llama-7b model with custom prompt format produces only gibberish
Updated a year ago1
Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware.
Updated a year ago3
Issue with GPTQ
Updated a year ago1
High PPL when groupsize != -1 for OPT model after replace linear layer with quantlinear.
Updated a year ago1
An error is reported when running python setup_cuda.py install
Updated a year ago2
can it support openllama model?
Closed a year ago
Could not obtain official perplexity using bloom_eval()
Updated a year ago
llama_inference 4bits error
Updated a year ago
AttributeError: 'QuantLinear' object has no attribute 'weight' (t5 branch) (Google/flan-ul2)
Closed a year ago2
CUDA out of memory on flan-ul2
Closed a year ago1
[Question] What is the expected discrepancy between simulated and actually computed values?
Updated a year ago4
The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.7)
Updated a year ago2
Sample code does not work
Updated a year ago2
SqueezeLLM support?
Updated a year ago
What is the right perplexity number?
Updated a year ago
Finetuning Quantized LLaMA
Updated a year ago
compare with llama.cpp int4 quantize?
Updated a year ago
How to quantize bloom after lora/ptuning?
Updated a year ago
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
Closed a year ago2
I use python llama.py to generate a quantized model, but I can't find the .safetensors model
Closed a year ago1
Wondering whether some of the triton or cuda kernel also speedup fp16 or not?
Updated a year ago
Errors encountered when running benchmark FP16 baseline on multiple GPUs
Updated a year ago2
Does this work for gptj specifically the cuda branch? Thanks!
Updated a year ago
Does not support 3bit quantization?
Updated a year ago
No CUDA_ENV / conda-froce cudatoolkit-dev freezes
Closed a year ago
Unable to run 'python setup_cuda.py install'
Updated a year ago
Build issue with newer torch pybind11 cast.h - workaround inside
Updated a year ago
6-bit quantization
Updated a year ago1
no module named quant_cuda (fastest-inference-4bit branch)
Updated a year ago1
fastest-inference-4bit fails to build
Closed a year ago3
Giepeto
Closed a year ago
Benchmark broken on H100
Updated a year ago