Giters
qwopqwop200
/
GPTQ-for-LLaMa
4 bits quantization of LLaMA using GPTQ
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
2932
Watchers:
42
Issues:
216
Forks:
453
qwopqwop200/GPTQ-for-LLaMa Issues
running on old gpu with fp32 only
Updated
a year ago
Comments count
3
How to inference llama-65b-4bit on mulgpu
Closed
a year ago
Comments count
6
Result with the branch `fastest-inference-4bit`
Closed
a year ago
Comments count
11
where to get /path/to/downloaded/llama/weights
Updated
a year ago
About the fine-grained of weight quantization
Updated
a year ago
OpenCL support
Updated
a year ago
Comments count
1
Errors to compile with CUDA 12.1
Closed
a year ago
Comments count
2
Error on A100,device kernel image is invalid
Updated
a year ago
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Updated
a year ago
Comments count
2
CUDA kernel sync problem
Closed
a year ago
Comments count
1
wbit=16 Conversion Gives Error
Updated
a year ago
Comments count
2
CUDA Benchmark on 2bit, 3bit, 4bit models - Why 3bit slower than 4bit, but faster than 2biit?
Closed
a year ago
Comments count
1
4bits on 65B
Closed
a year ago
Comments count
1
How can I get the gradient when using 4bits model?
Updated
a year ago
IndexError: tensors used as indices must be long, byte or bool tensors
Updated
a year ago
Comments count
2
CUDA error: unknown error (Error when quantize llama Model)
Updated
a year ago
Comments count
1
neox.py generates randrange() error
Closed
a year ago
Comments count
13
Security Issue: This Auto-downloads 800 trojan viruses
Closed
a year ago
Comments count
2
CUDA: 8bit quantized models are stupid.
Updated
a year ago
Comments count
4
File "<string>", line 21, in matmul_248_kernel
Updated
a year ago
NameError: name 'transformers' is not defined
Closed
a year ago
Comments count
2
llama 30b generates strange answers after quantizing to 4bit
Closed
a year ago
Comments count
1
why disable tf32 ?
Closed
a year ago
Comments count
4
slower inference speed
Closed
a year ago
Comments count
4
Inference with Beam > 1 broken in Triton
Closed
a year ago
Comments count
3
I implement an easy-to-use package based on cuda branch
Closed
a year ago
Comments count
3
module 'quant_cuda' has no attribute 'vecquant4matmul'
Updated
a year ago
Latest "change attention algorithm" commit breaks inference
Closed
a year ago
Comments count
5
Quantize 7b with 8GB VRAM OOM
Closed
a year ago
Comments count
2
triton branch is a lot slower than hipified cuda branch on AMD GPUs
Closed
a year ago
Comments count
1
Fused mlp causes assertion error
Updated
a year ago
Comments count
5
TypeError: expected string or bytes-like object
Updated
a year ago
Comments count
2
Compiled w/o GPU support. Am I missing something?
Closed
a year ago
Comments count
3
ERROR: Could not find a version that satisfies the requirement triton==2.0.0 (from versions: none)
Closed
a year ago
Comments count
1
Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems
Updated
10 months ago
Comments count
5
make into a package (like sterlind did)
Closed
a year ago
Comments count
5
llama.cpp ERROR
Closed
a year ago
Comments count
1
CUDA branch, multi GPU. "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"
Closed
a year ago
Comments count
4
Issue on Multi-GPU on the cuda branch (
Closed
a year ago
Found another repo claim they implemented GPTQ
Closed
a year ago
Comments count
1
What is the command to install Triton?
Closed
a year ago
Comments count
1
A 4-bit quantitative model will generate self questioning and self answering content
Closed
a year ago
Comments count
2
8-bit quantization has ridiculous PPL and output nonsense
Closed
a year ago
Comments count
3
my error
Closed
a year ago
Comments count
2
ModuleNotFoundError: No module named 'llama_inference_offload'
Closed
a year ago
Comments count
14
Killed
Closed
a year ago
Comments count
4
Is there a way to seperate from the prompt and the generated answer.
Closed
a year ago
Comments count
2
Installation issue | WSL 2
Closed
a year ago
Comments count
4
T5 Benchmark
Updated
a year ago
Comments count
25
"Token indices sequence length is longer than the specified maximum sequence length for this model"
Closed
a year ago
Comments count
1
Previous
Next