neuralmagic / nm-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://nm-vllm.readthedocs.io

neuralmagic/nm-vllm Issues

[Misc]: Move from using `PYBIND11_MODULE` macro to bind C++/CUDA kernels to python to using `TORCH_LIBRARY` macro
Closed 3 months ago1
[Usage]: Any guide for generating a 2:4 sparse int4 weight model?
Closed 2 months ago1
[Usage]: How to convert a gptq model to marlin24 format？
Closed 2 months ago
[Usage]:
Closed 2 months ago2
[Doc]: Are we capable of automatically converting GGUF models into Marlin, similar to our support for GPTQ?
Closed 2 months ago4
[Bug]: CUDA error: an illegal instruction was encountered
Closed 2 months ago2
[Feature]: New model request llama-3 70b
Closed 3 months ago2
[Feature]: Support LLama3
Closed 2 months ago2
[Bug]: I have created a docker image of 0.2.0 and ran same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns NULL
Closed 3 months ago16
[Bug]: When running repo hello world: RuntimeError: CUDA error: an illegal instruction was encountered
Closed 3 months ago3
[Doc]: quantize.py script for making Marlin quants
Closed 3 months ago3
[Draft RFC]: Int8 Activation Quantization
Closed 3 months ago2
[Doc]: Whether it's possible to apply both sparse and quantization simultaneously?
Closed 3 months ago1
How to get the sparsed model?
Closed 3 months ago2
Sparsity benchmarks
Closed 3 months ago2
[Doc]: Support Mixtral?
Closed 4 months ago3