janhavitayde/BART-CNN-dailymail

Title: Text Summarization using BART Model

1. Introduction :

Text summarization is a vital task in natural language processing (NLP) that aims to condense large amounts of text into shorter summaries while preserving the key information. In this report, we present our approach to text summarization using the BART (Bidirectional and Auto-Regressive Transformers) model. We will discuss the techniques employed, challenges faced during implementation, and evaluate the model's performance using ROUGE scores. Additionally, we will explore potential applications of text summarization models like BART.

2. Approach :

i. Dataset: We utilized the CNN/DailyMail dataset, which consists of news articles paired with human-written summaries. Data Preprocessing: We performed preprocessing steps such as removing non-alphanumeric characters, converting to lowercase, contraction mapping, tokenization, stop word removal, and lemmatization. ii. BART Model: We employed the BART model, specifically the "facebook/bart-large-cnn" pre-trained model, which is designed for text summarization tasks. iii. BART Tokenizer: We used the BART tokenizer to prepare the input data for the model, applying truncation and limiting the maximum length of the input text. iv. Text Summarization: We developed a function to generate summaries using the BART model, employing techniques such as beam search, length penalty, and early stopping.

3. Techniques Used :

Data Preprocessing: We cleaned the text data by removing noise and irrelevant information, ensuring better quality input for the model.

BART Model: We leveraged the power of pre-trained transformer-based models like BART, which have been trained on extensive amounts of data and have demonstrated state-of-the-art performance in various NLP tasks.

BART Tokenizer: The tokenizer helped tokenize the input text and convert it into appropriate input format for the BART model. Beam Search: We used beam search to explore multiple candidate summaries and select the one with the highest likelihood according to the BART model.

Length Penalty: To encourage shorter summaries, we applied length penalty during the generation process to balance brevity and informativeness.

Early Stopping: We employed early stopping to terminate summary generation when the desired length or convergence criteria were met.

4. Model Performance :

To evaluate the performance of our BART model for text summarization, we used the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric. ROUGE scores measure the overlap between the generated summary and the reference summary in terms of n-gram matches, capturing recall, precision, and F1-score.

Our model achieved the following ROUGE scores for a sample summary:

ROUGE-1 Unigrams: 0.57

ROUGE-2 Bigrams: 0.40

ROUGE-L LCS: 0.51

5. Potential Applications :

i. News Aggregation: Summarizing news articles can help users quickly grasp the main points. ii. Document Summarization: For large documents, automatic summarization can provide executive summaries saving time for readers. iii. Information Extraction: Text summarization can aid in extracting key information from lengthy documents, such as legal texts or scientific papers. iv. Chatbots and Virtual Assistants: Incorporating text summarization models can enhance chatbots and virtual assistants by generating concise responses based on user queries.

janhavitayde / BART-CNN-dailymail