sramshetty / mixture-of-depths

An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mixture of Depths

An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"

Setup

  • First follow instructions for setting up your environment for Llama 2 here.
  • Then:
pip install einops

Details

  • Implementing MoD in Llama 2

  • Follow paper's configuration with some assumptions.

    • Route every other layer
    • Training configurations for both causal inference methods proposed
  • Notes on auxiliary router for causal inference:

    • Currently, we train it separately after MoD Llama is trained.
    • Simple task as we achieve high token prediction accuracy quickly, which is further simplified by using a simple dataset.
  • MoD_training.ipynb demonstrates training and was used for the results below.

  • MoD_sampling.ipynb demonstrates generation with each method.

Results

  • 50 million parameter model
    • C4
      • Baseline after 1 epoch:
        • Loss: 3.73
        • Samples/sec: 6.79
      • MoD w/ Auxiliary Loss after 1 epoch:
        • Loss: 3.81
        • Samples/sec: 8.15
      • MoD w/ Auxiliary Router after 1 epoch:
        • Loss: 4.19
        • Samples/sec: 7.64
    • Tiny Stories
      • Baseline after 5 epochs:
        • Loss: 2.46
        • Samples/sec: 11.22
      • MoD w/ Auxiliary Loss after 5 epochs:
        • Loss: 2.55
        • Samples/sec: 11.33
      • MoD w/ Auxiliary Router after 5 epochs:
        • Loss: 2.48
        • Auxilairy Router Causal Loss: 0.15
        • Samples/sec: 11.54

TODO

  • Validate
  • Sampling methods
    • Auxiliary loss
    • "Second" router

Citations

@misc{raposo2024mixtureofdepths,
    title={Mixture-of-Depths: Dynamically allocating compute in transformer-based language models}, 
    author={David Raposo and Sam Ritter and Blake Richards and Timothy Lillicrap and Peter Conway Humphreys and Adam Santoro},
    year={2024},
    eprint={2404.02258},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

About

An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"

License:Other


Languages

Language:Python 54.0%Language:Jupyter Notebook 46.0%