scar-ai / The-Latentformer

A Mixture of Experts model with latent attention designed for efficient training and inference.

Repository from Github https://github.comscar-ai/The-LatentformerRepository from Github https://github.comscar-ai/The-Latentformer

Advancing MoE with Latentformer

Language Framework

This repository contains the complete implementation of a sophisticated Transformer-based language model, featuring a unique Multi Latent Attention (MLA) mechanism and a Mixture-of-Experts (MoE) feed-forward layer. The model is designed for high-performance text generation and is built to scale efficiently using distributed training.

A regular transformer version of this model (single FFN, no routing) which you can find on the "Old" branch of this repo beat gpt-2 large testing perplexity on wikitext-2 in 2h36 when trained on a node of 8 AMD MI300X with ~300M parameters.

This project provides the full codebase, from the architectural backbone and data processing pipelines to single-GPU and distributed training scripts, and a ready-to-use interactive Streamlit application for inference.

✨ Key Features

  • Multi Latent Attention (MLA): A novel attention mechanism first introduced in the Deepseek-V3 paper that splits query and key projections into two paths: a content-based path and a rotary-based path. This allows the model to separately process and weigh contextual information and positional information, leading to more nuanced text generation.
  • Mixture-of-Experts (MoE) Layers: The feed-forward network in each Transformer block is replaced with a sparse MoE layer. This allows the model to have a very high parameter count while only activating a small subset of expert networks for each token, drastically improving training and inference efficiency. The router architecture was inspired by the HuggingFace post on MoE's.
  • Rotary Position Embeddings (RoPE): Implements state-of-the-art relative position embeddings, which are embedded into the MLA mechanism.
  • Distributed Training Ready: Includes a script (main_distributed.py) that leverages PyTorch's DistributedDataParallel (DDP) for robust and scalable multi-GPU training (tested on a node of 8 AMD MI300X).
  • Custom Data Pipeline: A dedicated data loader (OpenWebText.py) for processing the OpenWebText dataset, including on-the-fly tokenization, cleaning, and batching.
  • Interactive Demo: A user-friendly Streamlit application (user.py) to interact with the trained model, featuring real-time text generation and adjustable sampling parameters.

📂 Repository Structure & File Guide

This repository is organized to provide a clear path from understanding the model's architecture to training it and finally using it for inference.

1. Model Architecture

  • model.py: This is the heart of the project. It defines the complete model architecture, including:
    • TheTransformer: The main class that assembles the entire model.
    • MultiHeadAttention: The custom Multi Latent Attention implementation.
    • GatingNetwork & TransformerBlock: The core components for the Mixture-of-Experts (MoE) layers.
    • RotaryPositionEncoding: The implementation for RoPE.

2. Training the Model

The repository includes two scripts for training the model, catering to different hardware setups.

  • training.py (Single-GPU Training)

    • Purpose: A straightforward script for training the model on a single GPU.
    • Details: It handles data loading, model initialization, a standard training loop with mixed-precision support (torch.amp), and a custom learning rate scheduler.
    • Use Case: Ideal for debugging, running smaller-scale experiments, or for users who do not have a multi-GPU environment.
  • main_distributed.py (Multi-GPU Distributed Training)

    • Purpose: The primary script for training the full-scale model efficiently across multiple GPUs.
    • Details: It leverages PyTorch's DistributedDataParallel (DDP) and DistributedSampler to parallelize the training process. It also includes an optional token-dropping feature as a regularization technique.
    • Use Case: The recommended script for training the model from scratch to achieve the best performance on large datasets.

3. Using the Model for Inference

  • user.py (Interactive Streamlit Demo)
    • Purpose: A web-based application for generating text with the trained model.
    • How to Use:
      1. Ensure you have a trained model checkpoint (e.g., weights/mol.pth). The script is pre-configured to look for this file.
      2. Install the required Python packages: pip install -r requirements.txt.
      3. Run the application from your terminal:
        streamlit run user.py

About

A Mixture of Experts model with latent attention designed for efficient training and inference.

License:GNU General Public License v2.0


Languages

Language:Python 100.0%