turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

turboderp/exllamav2 Issues

[BUG] Quantization of Qwen return garbage
Updated 6 days ago8
Curious about Exllama+TP
Updated 7 days ago10
How to implement paged attention in HF format?
Updated 8 days ago5
[REQUEST] need exllamav2-0.2.1+cu121.torch2.4.0-cp310-cp310-win_amd64.whl
Updated 10 days ago
## Measurement/inference error (3): hidden_states
Updated 10 days ago6
Error in quant
Updated 11 days ago2
[BUG] 0.2.1 doesn't compile on Opensuse
Closed 12 days ago6
Severe model degradation observed when upgrading from v0.1.8 to v0.2.0
Closed 12 days ago4
linear growth in system RAM during load change in v0.1.8 to v0.2
Closed 13 days ago1
Batch generation with Exllamav2_HF is weird
Closed 15 days ago7
Command R+ is broken?
Closed 15 days ago16
Want to try row split + all_reduce for MLP and attn
Updated 17 days ago4
A doubt regarding filters/tools.
Closed 18 days ago3
how can i solve this problem
Updated 19 days ago
Pipeline mode support
Closed 19 days ago2
Exllama v2 crashes when starting to load model in the third gpu
Closed 20 days ago22
lollms exllamav2 binding module not found
Updated 20 days ago1
Remove tokens and system prompt from generation
Closed 22 days ago1
Got error while running qwen72b_4.25 using inference_tp.py
Closed 22 days ago1
Tensor parallelism issues
Updated 22 days ago5
Does NVLink improve tensor parallelism?
Updated 22 days ago1
Async Stream Genenerator?
Updated 23 days ago3
MemoryError despite sufficient system resources
Closed 23 days ago2
Do you know of any code framework that supports fast attention score calculation similar to flash attention?
Updated a month ago
Request for multi model support
Updated a month ago
ModuleNotFoundError: No module named 'blessed'
Closed a month ago1
[qesstion] Wrapper Linear API and 2bits
Updated a month ago4
problem with cache.
Closed a month ago15
[ERROR] Worker (pid:25134) was sent SIGKILL! Perhaps out of memory?
Updated a month ago12
Llama 3 speed
Updated a month ago2
Will it support CPU offloading?
Updated a month ago5
graphrag can't index using mistral large 123B with exllamav2
Updated a month ago1
Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2?
Updated 2 months ago8
name 'flash_attn_func' is not defined
Closed 2 months ago1
Add more docs and type annotations
Updated 2 months ago
orig_func Quantization error
Updated 2 months ago3
Quantizing Llama 3.1 405B
Closed 2 months ago39
Unable to load gptq models
Closed 2 months ago2
Triton Support
Updated 2 months ago1
Enhancement: Docker Image Github Actions
Closed 2 months ago1
Multicache test falied
Closed 2 months ago1
No prebuilt pip package for version 0.1.8
Closed 2 months ago1
Got error in new model LLama 3.1 : Value for eos_token_id is not of expected type <class 'int'>
Closed 2 months ago3
Problem with async-generator
Closed 2 months ago6
Speculative Decoding not working with ExLlamaV2DynamicGeneratorAsync
Closed 2 months ago1
Streaming Issue with ExLlamaV2DynamicJobAsync
Closed 2 months ago3
I am trying to add DeepSeekV2Moe model
Updated 2 months ago8
Can you share some formulas about GPTQ dequant？
Closed 2 months ago6
Manual model merges
Updated 2 months ago2
Prefill not done
Closed 2 months ago2