brightgems / text-summarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Abstractive Text-summarization

Dataset

The dataset used for this project is CNN/Daily Mail. You can download it using this link. Linux/Unix:

wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz

Windows

  • Download from here: get the dataset here
  • Extract the contents of tar.gz file into the data\ repo.
  • After extraction you should see train.source, train.target, test.source, tes.target, val.source and val.target files in your data directory.

Transfer Learning

  • Transfer of knowledge among different tasks.
  • The idea of pretraining a model and transferring its knowledge to downstream tasks.
  • Using the knowledge base of a pretrained model, new tasks are solved.
We used T5 and BART pretrained transformer models and fine tuned them for the text summarization tasks. We fine tuned the model on the CNN/Daily mail dataset.
We achieved pretty decent results. In addition, we also built a transformer model with self-attention layers. This modle requires longer training duration and a larger pretraining dataset to yield higher performance. 

T5 Transformer

  • T5 Text-to-Text Transfer Transformer.
  • Proposed by google AI's Colin et al.
  • 3 Types
  • Encoder-Decoder Transformer
  • Language Model Transformer(Autoregressive)
  • Prefix Language Model Transformer
  • Fill in the blan denoising.
  • Useful for many downstream tasks.

BART Transformer

  • Proposed by Facebook AI's Mike et.al.
  • BART = BERT + GPT
  • uses the GELU activation function instead of RELU.
  • Yields more semantically sensible summaries compared to the summaries generated by the T5 transformer.
  • Bidirectional Encoder and Autoregressive Decoder are used in this model.

Novel Transformer

  • 6 Attention Heads
  • 3 Encoder and Decoder Layers
  • also using pretrained Fasttext Embeddings of 300 dimensions in the embedding layer.
  • Positional Encoding for the continous context.
  • Dimensionality of the hidden state: 300
  • Attention and padding masks as needed.

About


Languages

Language:Jupyter Notebook 86.0%Language:Python 14.0%