Rahul-Tawar / Seq2Seq-EN_UR

Repository from Github https://github.comRahul-Tawar/Seq2Seq-EN_UR

English to Urdu Translation Model

This project implements a sequence-to-sequence model with Bahdanau attention for English to Urdu translation.

Project Structure

model.py: Contains the encoder-decoder architecture with attention mechanism
data_preparation.py: Handles data loading and preprocessing
train.py: Training script for the model
config.py: Configuration parameters
requirements.txt: Project dependencies

Setup

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Download spaCy models:

python -m spacy download en_core_web_sm
python -m spacy download xx_ent_wiki_sm

Data Preparation

The model expects parallel English-Urdu sentence pairs. The data should be organized in the following format:

Training data: data/train.en and data/train.ur
Validation data: data/val.en and data/val.ur
Test data: data/test.en and data/test.ur

Training

To train the model:

python train.py

The model will be saved in the models/ directory when validation loss improves.

Model Architecture

Encoder: Bi-directional LSTM
Decoder: LSTM with Bahdanau attention
Attention: Bahdanau attention mechanism
Embedding: Word embeddings for both languages

Hyperparameters

All hyperparameters can be adjusted in config.py:

Embedding dimension: 256
Hidden dimension: 512
Number of layers: 2
Dropout: 0.5
Batch size: 64
Learning rate: 0.001
Number of epochs: 20

Evaluation

The model uses cross-entropy loss for training and validation. The best model is saved based on validation loss.

Requirements

Python 3.7+
PyTorch 1.9.0+
spaCy
torchtext
tqdm
numpy
pandas

About

Languages

Language:Python 100.0%