kyegomez / BitNet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

Home Page:https://discord.gg/qUtxnK2NMf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BitNet model performs wrose than Base Transformer

johanssontan opened this issue · comments

I use the train.py to train the BitNet model and Base Transformer Model and have a comparison with them, I found BitNet consumes more time and space while achieveing lower loss compared to base model, which is not consistent with what the BitNet paper clamins. What could be the reason for this?

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Not sure if it is a bug...

In my understanding of BitNet, training BitNet will cost more space as both int1 weights and fp16 weights (if you go for mixed precision training else fp32) are stored. It will be slim and fast for inference. 🤔