LangDist: Distributed Training Framework for Large Language Models

Overview

LangDist is a scalable distributed training framework for Large Language Models (LLMs) built on Ray. It efficiently trains LLMs across multiple devices, utilizing Ray's powerful distributed computing capabilities.

Features

Distributed Training: Parallel training across multiple devices.
Ray Integration: Efficient task distribution and execution with Ray.
FastAPI Backend: API for device registration, task submission, and status monitoring.
Model and Dataset Management: Simple upload and management of models and datasets.
React Frontend: User-friendly interface for interacting with the training system.
Flexible Model Support: Compatible with various LLM architectures, including LLaMA.

System Architecture

LangDist is composed of the following components:

Backend API (main.py): FastAPI server for handling device registration, task submission, and status queries.
Ray Setup (ray_setup.py): Configures and initializes the Ray cluster.
Training Module (train.py): Implements distributed training logic using Ray.
Data Loader (dataloader.py): Handles loading and preprocessing of datasets.
Model Loader (model.py): Manages loading and initialization of LLM models.
Frontend (App.tsx): React-based user interface for interacting with the system.

Installation

Clone the repository:

git clone https://github.com/yourusername/LangDist.git
cd LangDist

Installation

Install the required Python packages:
```
pip install -r requirements.txt
```
Install Node.js and npm (for the frontend).
Install frontend dependencies:
```
cd frontend
npm install
```

Usage

Starting the Backend

Navigate to the project root directory.

Run the FastAPI server:

uvicorn main:app --host 0.0.0.0 --port 8000

Starting the Frontend

Navigate to the frontend directory.
Start the React development server:
```
npm start
```

API Endpoints

    "POST /devices/register: Register a new device for distributed training.",
    "GET /devices: List all registered devices.",
    "POST /tasks/submit: Submit a new training task.",
    "GET /tasks: List all submitted tasks.",
    "GET /tasks/{task_id}/status: Check the status of a specific task.",
    "POST /upload/model/: Upload a custom model file.",
    "POST /upload/dataset/: Upload a custom dataset file."

Distributed Training

"1. Register available devices using the /devices/register endpoint.", "2. Upload your model and dataset using the respective upload endpoints.", "3. Submit a training task via the /tasks/submit endpoint, specifying the model, dataset, and devices to use.", "4. Monitor the task status using the /tasks/{task_id}/status endpoint."

Development

Running Tests

"Execute the test suite to ensure system integrity:", " pytest test_main.py test_ray.py"

sharikjavid / LangDist

LangDist: Distributed Training Framework for Large Language Models

Overview

Features

System Architecture

Installation

Installation

Usage

Starting the Backend

Starting the Frontend

API Endpoints

Distributed Training

Development

Running Tests

About

Languages