intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for BitNet b1.58 quantization

vegax87 opened this issue · comments

Is your feature request related to a problem? Please describe.
Currently 8-bit and 4-bit are de facto standard quantization algorithms, but I would like to have the implementation of BitNet b1.58 algorithm which improves training speed, inference speed and maintains accuracy of FP16 values by rounding every weight to ternary values (-1, 0, +1)

Describe the solution you'd like
add BitNet b1.58 quantization in the library

Describe alternatives you've considered
There are no alternatives as far as I know, it's a novel quantization algorithm

Additional context
Original paper: https://arxiv.org/pdf/2402.17764.pdf

I agree. I expect this to have big impact as LLMs generation is bandwidth bound so smaller weight size will translate in better performance. This feature requires driver updates to be implemented I'll update this ticket once a compatible driver is available

I try mistral run in NPU (155H) vs run in ollama, the ollama version is better than NPU version.
I think this should be it is smaller, so it can read the memory more faster.
support quantization is a better choice.

I agree, quantization support is really important for performance. Mostly because decoding is DRAM bandwidth bounded and so small weights => small data transfer => better performance (https://intel.github.io/intel-npu-acceleration-library/llm_performance.html)
We are currently doing driver work to properly support mixed precision inference in the NPU should come in the next driver releases. Stay tuned ;)

Microsoft has published an updated paper with a basic implementation of BitNet 1.58 in Pytorch:

https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

UPDATE: There's another very interesting article that combines 1-bit/2-bit with Half-Quadratic Quantization:

https://mobiusml.github.io/1bit_blog/