Rahul-Tawar / Seq2Seq-EN_UR

Repository from Github https://github.comRahul-Tawar/Seq2Seq-EN_URRepository from Github https://github.comRahul-Tawar/Seq2Seq-EN_UR

English to Urdu Translation Model

This project implements a sequence-to-sequence model with Bahdanau attention for English to Urdu translation.

Project Structure

  • model.py: Contains the encoder-decoder architecture with attention mechanism
  • data_preparation.py: Handles data loading and preprocessing
  • train.py: Training script for the model
  • config.py: Configuration parameters
  • requirements.txt: Project dependencies

Setup

  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Download spaCy models:
python -m spacy download en_core_web_sm
python -m spacy download xx_ent_wiki_sm

Data Preparation

The model expects parallel English-Urdu sentence pairs. The data should be organized in the following format:

  • Training data: data/train.en and data/train.ur
  • Validation data: data/val.en and data/val.ur
  • Test data: data/test.en and data/test.ur

Training

To train the model:

python train.py

The model will be saved in the models/ directory when validation loss improves.

Model Architecture

  • Encoder: Bi-directional LSTM
  • Decoder: LSTM with Bahdanau attention
  • Attention: Bahdanau attention mechanism
  • Embedding: Word embeddings for both languages

Hyperparameters

All hyperparameters can be adjusted in config.py:

  • Embedding dimension: 256
  • Hidden dimension: 512
  • Number of layers: 2
  • Dropout: 0.5
  • Batch size: 64
  • Learning rate: 0.001
  • Number of epochs: 20

Evaluation

The model uses cross-entropy loss for training and validation. The best model is saved based on validation loss.

Requirements

  • Python 3.7+
  • PyTorch 1.9.0+
  • spaCy
  • torchtext
  • tqdm
  • numpy
  • pandas

About


Languages

Language:Python 100.0%