This project implements a sequence-to-sequence model with Bahdanau attention for English to Urdu translation.
model.py
: Contains the encoder-decoder architecture with attention mechanismdata_preparation.py
: Handles data loading and preprocessingtrain.py
: Training script for the modelconfig.py
: Configuration parametersrequirements.txt
: Project dependencies
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Download spaCy models:
python -m spacy download en_core_web_sm
python -m spacy download xx_ent_wiki_sm
The model expects parallel English-Urdu sentence pairs. The data should be organized in the following format:
- Training data:
data/train.en
anddata/train.ur
- Validation data:
data/val.en
anddata/val.ur
- Test data:
data/test.en
anddata/test.ur
To train the model:
python train.py
The model will be saved in the models/
directory when validation loss improves.
- Encoder: Bi-directional LSTM
- Decoder: LSTM with Bahdanau attention
- Attention: Bahdanau attention mechanism
- Embedding: Word embeddings for both languages
All hyperparameters can be adjusted in config.py
:
- Embedding dimension: 256
- Hidden dimension: 512
- Number of layers: 2
- Dropout: 0.5
- Batch size: 64
- Learning rate: 0.001
- Number of epochs: 20
The model uses cross-entropy loss for training and validation. The best model is saved based on validation loss.
- Python 3.7+
- PyTorch 1.9.0+
- spaCy
- torchtext
- tqdm
- numpy
- pandas