IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

IST-DASLab/marlin Issues

perfmance
Updated a month ago
Server, TGI and/or vLLM Support
Closed a month ago7
cant build marlin
Closed a month ago1
Issues to generate tokens after "get_llama_marlin"
Updated 2 months ago
a_sh_rd_delta_o
Updated 2 months ago
questions about slice_col_par
Updated 2 months ago2
Marlin slower than fp16 on larger batches
Updated 2 months ago2
Questions about matrix A's layout in shared memory.
Updated 2 months ago
Does Marlin support zero-point quantization?
Updated 2 months ago7
[QST] Weight Format & GEMM
Updated 2 months ago2
Support for Hopper H100
Updated 2 months ago3
[Bug] H800 run UT failed.
Updated 2 months ago3
groupsize=64 is not supported
Updated 3 months ago
Do you have any plan support moe gemm?
Updated 3 months ago
can this support lower bit quant?
Updated 3 months ago3
Small typo in the shape description
Updated 3 months ago
Open: optimize for GEMM regime
Closed 4 months ago7
Packing order (`_perm` and `_scale_perm`)
Closed 4 months ago5
Where in the code uses "immediate eviction" and "fetched from L2 cache"??
Updated 4 months ago2
Turing support
Updated 4 months ago1