Add support for BitNet b1.58 quantization

Question

Add support for BitNet b1.58 quantization

vegax87 opened this issue 5 months ago · comments

Is your feature request related to a problem? Please describe.
Currently 8-bit and 4-bit are de facto standard quantization algorithms, but I would like to have the implementation of BitNet b1.58 algorithm which improves training speed, inference speed and maintains accuracy of FP16 values by rounding every weight to ternary values (-1, 0, +1)

Describe the solution you'd like
add BitNet b1.58 quantization in the library

Describe alternatives you've considered
There are no alternatives as far as I know, it's a novel quantization algorithm

Additional context
Original paper: https://arxiv.org/pdf/2402.17764.pdf

Alessandro Palla · Answer 1 · Tue Mar 05 2024 00:04:36 GMT+0800 (China Standard Time)

I agree. I expect this to have big impact as LLMs generation is bandwidth bound so smaller weight size will translate in better performance. This feature requires driver updates to be implemented I'll update this ticket once a compatible driver is available

RainLi · Answer 2 · Tue Mar 12 2024 21:39:52 GMT+0800 (China Standard Time)

I try mistral run in NPU (155H) vs run in ollama, the ollama version is better than NPU version.
I think this should be it is smaller, so it can read the memory more faster.
support quantization is a better choice.

Alessandro Palla · Answer 3 · Thu Mar 14 2024 05:58:59 GMT+0800 (China Standard Time)

I agree, quantization support is really important for performance. Mostly because decoding is DRAM bandwidth bounded and so small weights => small data transfer => better performance (https://intel.github.io/intel-npu-acceleration-library/llm_performance.html)
We are currently doing driver work to properly support mixed precision inference in the NPU should come in the next driver releases. Stay tuned ;)

Vegax · Answer 4 · Fri Mar 22 2024 07:44:09 GMT+0800 (China Standard Time)

Microsoft has published an updated paper with a basic implementation of BitNet 1.58 in Pytorch:

https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

UPDATE: There's another very interesting article that combines 1-bit/2-bit with Half-Quadratic Quantization:

https://mobiusml.github.io/1bit_blog/