AshishRajIITI / BERT_summarization_1

Tutorial for first time BERT users,

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The pre-print article is out!

please visit, and suggest if you want to see any changes. I thanks our co-authors/collaborators Bowen Tan and Yiming Niu from Rockefeller University. https://arxiv.org/abs/2006.01997


command line interface

I added a more user friendly command line pre-processing/training/summarization codes for the GPT2. These are the GPT2_preprocessing.py, trainGPT2.py, and GPT2_summarizer.py. To use it, first you'd need Huggingface's transformer package, and a folder where you'd want to save your fine-tuned model on. For the training and validation dataset, refer to the notebook pre-processing-text-for-GPT2-fine-tuning. (Update on Aug 21 2020)

setting up the environment

To install from the Pipfile

pipenv install

or to install from the requirements.txt

pip install -r requirements.txt

pre-processing

The code is in the GPT2_preprocessing.py and the helperGPT2.py. The pre-processing takes the input = metadata.csv given in the COVID19 Open Research dataset

python GPT2_preprocessing.py --input=metadata.csv

the output is the pytorch tensordataset

Training the GPT2

Use the pytorch tensordataset to train the GPT2 (preferrably you separate the training and validation dataset how you like it)

mkdir fine_tuned_folder

python train_command_line.py --epochs=1 --train_data='insert-your-training-data-here' --val_data='insert-your-validation-data-here' --model_name='fine_tuned_folder'

You'd need GPU and cuda to train GPT2. 100 iterations took me about 44 seconds on 1 Nvidia Tesla P-100.

Generating the summary

To generate a summary give the input file as .txt file and the model directory where the pytorch_model.bin and the config files are kept.

python GPT2_summarize.py --input_file='input.txt' --model_directory='insert-your-model-directory-here'


Directory

notebook

This folder contains colab notebooks that guide you through the summarization by BERT and GPT-2. You should be able to open it on your Google's colab, and play with your data. The text .csv file, the post-processed training tensor file, and fine-tuned model weight tensor are available upon request.

Both BERT and GPT-2 models are implemented in the Transformer library by Huggingface. The description of each notebooks are listed below. The citation and related works are in the "generate-summary-with-BERT-or-GPT2" notebook.

Primer-to-BERT-extractive-summarization

Tutorial for beginners, first time BERT users. Also a text summarization tool, useing BERT encoder, and topic clustering approach. This is what's called "extractive summarization", meaning, a key sentences containing crucial information is extracted from the paragraph.

As a companion to BERT, I added GPT2 summarization. This is so-called "abstractive summarization". I fine-tuned the already pre-trained GPT2 for specific summarization task.

training_BERT_keyword_extraction

This notebook serves as an auxilliary work on keyword extraction. In this notebook, I fine-tuned BERT token classification on the POS tagging task and use it to do word extraction.

pre-processing_raw_text_for_GPT2_summarizer

This notebook guides you through a pre-processing that turn the text data to a tokenized tensors that is ready for the training. The raw data can be obtained from this website.

fine-tune-GPT2-for-summarization

I use Ignite, which is a pytorch-based library to help keeping track of training. The data comes from the pre-processing step in the previous notebook.

generate-summary-with-BERT-or-GPT2

I figured out how to train GPT2 model to a reasonable outcome. This notebook summarizes how the data is processed from the text format to a tokenized query for summarization.

About

Tutorial for first time BERT users,


Languages

Language:Jupyter Notebook 96.0%Language:Python 4.0%