qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

qwopqwop200/GPTQ-for-LLaMa Issues

running on old gpu with fp32 only
Updated a year ago3
How to inference llama-65b-4bit on mulgpu
Closed a year ago6
Result with the branch `fastest-inference-4bit`
Closed a year ago11
where to get /path/to/downloaded/llama/weights
Updated a year ago
About the fine-grained of weight quantization
Updated a year ago
OpenCL support
Updated a year ago1
Errors to compile with CUDA 12.1
Closed a year ago2
Error on A100，device kernel image is invalid
Updated a year ago
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Updated a year ago2
CUDA kernel sync problem
Closed a year ago1
wbit=16 Conversion Gives Error
Updated a year ago2
CUDA Benchmark on 2bit, 3bit, 4bit models - Why 3bit slower than 4bit, but faster than 2biit?
Closed a year ago1
4bits on 65B
Closed a year ago1
How can I get the gradient when using 4bits model?
Updated a year ago
IndexError: tensors used as indices must be long, byte or bool tensors
Updated a year ago2
CUDA error: unknown error (Error when quantize llama Model)
Updated a year ago1
neox.py generates randrange() error
Closed a year ago13
Security Issue: This Auto-downloads 800 trojan viruses
Closed a year ago2
CUDA: 8bit quantized models are stupid.
Updated a year ago4
File "<string>", line 21, in matmul_248_kernel
Updated a year ago
NameError: name 'transformers' is not defined
Closed a year ago2
llama 30b generates strange answers after quantizing to 4bit
Closed a year ago1
why disable tf32 ?
Closed a year ago4
slower inference speed
Closed a year ago4
Inference with Beam > 1 broken in Triton
Closed a year ago3
I implement an easy-to-use package based on cuda branch
Closed a year ago3
module 'quant_cuda' has no attribute 'vecquant4matmul'
Updated a year ago
Latest "change attention algorithm" commit breaks inference
Closed a year ago5
Quantize 7b with 8GB VRAM OOM
Closed a year ago2
triton branch is a lot slower than hipified cuda branch on AMD GPUs
Closed a year ago1
Fused mlp causes assertion error
Updated a year ago5
TypeError: expected string or bytes-like object
Updated a year ago2
Compiled w/o GPU support. Am I missing something?
Closed a year ago3
ERROR: Could not find a version that satisfies the requirement triton==2.0.0 (from versions: none)
Closed a year ago1
Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems
Updated 10 months ago5
make into a package (like sterlind did)
Closed a year ago5
llama.cpp ERROR
Closed a year ago1
CUDA branch, multi GPU. "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"
Closed a year ago4
Issue on Multi-GPU on the cuda branch (
Closed a year ago
Found another repo claim they implemented GPTQ
Closed a year ago1
What is the command to install Triton?
Closed a year ago1
A 4-bit quantitative model will generate self questioning and self answering content
Closed a year ago2
8-bit quantization has ridiculous PPL and output nonsense
Closed a year ago3
my error
Closed a year ago2
ModuleNotFoundError: No module named 'llama_inference_offload'
Closed a year ago14
Killed
Closed a year ago4
Is there a way to seperate from the prompt and the generated answer.
Closed a year ago2
Installation issue | WSL 2
Closed a year ago4
T5 Benchmark
Updated a year ago25
"Token indices sequence length is longer than the specified maximum sequence length for this model"
Closed a year ago1