The provided code is a Jupyter Notebook (ICTMCV2.ipynb
) for a sequence-to-sequence (Seq2Seq) model trained for text summarization tasks in Arabic. It uses the Hugging Face transformers
library and leverages pre-trained models for Arabic language processing.
Make sure you have the necessary dependencies installed. You can install them using the following commands:
!pip install transformers
!pip install datasets
!pip install arabert
!pip install sentencepiece
!pip install arabicnlp
The code defines several preprocessing functions to clean and prepare the input text for the summarization model. These functions include removing diacritics, extra spaces, repeated characters, quotes, punctuation, and certain special characters. The Sequential
function is used to apply a sequence of these preprocessing functions to the input text.
The script initializes and loads a Seq2Seq model for Arabic text summarization using the transformers
library. The model can be configured to use different pre-trained models such as BERT, BART, etc. The selected model is loaded, and a tokenizer is created.
The code loads labeled validation data (labeled_validation_dataset.jsonl
) and additional datasets required for training and evaluation.
The training process involves creating a custom PyTorch dataset (SummarizationDataset
) for the training data. The collate_fn
function is used to collate batches for training. The script defines functions for training epochs, calculating loss, and evaluating accuracy.
The code fine-tunes the loaded model on the training dataset and evaluates its performance on the validation dataset. The training process involves optimizing the model parameters using the AdamW optimizer.
The script implements a custom LoRA (Learnable Rank Adaptation) layer, which is a modification to the original model's layers. The LoRA layer is used to adapt the model's weights based on their rank.
The code provides functions for generating summaries using the trained model. It uses the generate
function to produce summaries for input text.
The script evaluates the model's performance on the validation dataset, calculating Rouge scores and cosine similarity scores between generated summaries and ground truth summaries.
The final section of the code presents an analysis of the generated summaries, including Rouge scores, cosine similarity scores, and comparisons between generated and ground truth summaries.
The trained model and its parameters can be saved using the save_pretrained
method.
This code serves as a comprehensive solution for training, fine-tuning, and evaluating a Seq2Seq model for Arabic text summarization using pre-trained transformer models.