BitNet model performs wrose than Base Transformer

Question

BitNet model performs wrose than Base Transformer

johanssontan opened this issue 2 months ago · comments

I use the train.py to train the BitNet model and Base Transformer Model and have a comparison with them, I found BitNet consumes more time and space while achieveing lower loss compared to base model, which is not consistent with what the BitNet paper clamins. What could be the reason for this?

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

Stupid · Answer 1 · Sun Jul 14 2024 01:45:32 GMT+0800 (China Standard Time)

Not sure if it is a bug...

In my understanding of BitNet, training BitNet will cost more space as both int1 weights and fp16 weights (if you go for mixed precision training else fp32) are stored. It will be slim and fast for inference. 🤔