Implementation of summarization baselines. For the moment there are:
- Random: Select n sentences randomly.
- Lead: Select the n first sentences.
- LexRank: Compute similarity between sentences and select the n first sentences ranked using PageRank style algorithm
- Bart: Transformer model. Implementation thanks huggingface.
- T5: Transformer model. Implementation thanks huggingface.
More baselines to come
git clone https://github.com/airKlizz/SummarizationBaselines
cd SummarizationBaselines
wget https://raw.githubusercontent.com/airKlizz/MultiDocMultiLingualSum/master/en_wiki_multi_news.py
There is no requirements.txt
file yet so install depedencies on demand 🤗.
To run baseline you first have to configure the run_args.json
file with your parameters.
This repository is based on nlp
library for load data and to compute ROUGE metric.
The idea is that you have a summarization dataset (nlp.Dataset
class) with at least a column with texts to summarize (document_column_name
) and one column with reference summaries (summary_colunm_name
). Then you want to run multiple baselines on it and compare ROUGE results of these differents methods of summarization.
You can add your summarization model (extractive or abstractive) as a new baseline to compare its performance with other baselines. Go here for more details to add a baseline.
Once you have all baselines you need and your dataset you can configure the run_args.json
file.
This is an example of a run_args.json
file:
{
"baselines": [
{"baseline_class": "Random", "init_kwargs": {"name": "Random"}, "run_kwargs": {"num_sentences": 10}},
{"baseline_class": "Lead", "init_kwargs": {"name": "Lead"}, "run_kwargs": {"num_sentences": 10}},
{"baseline_class": "LexRank", "init_kwargs": {"name": "LexRank"}, "run_kwargs": {"num_sentences": 10, "threshold": 0.03, "increase_power": true}},
{
"baseline_class": "Bart",
"init_kwargs": {
"name": "Bart CNN",
"model_name": "bart-large-cnn",
"input_max_length": 512,
"device": "cuda",
"batch_size": 8
},
"run_kwargs": {
"num_beams": 4,
"length_penalty": 2.0,
"max_length": 400,
"min_length": 200,
"no_repeat_ngram_size": 3,
"early_stopping": true
}
},
{
"baseline_class": "T5",
"init_kwargs": {
"name": "T5 base",
"model_name": "t5-base",
"input_max_length": 512,
"device": "cuda",
"batch_size": 8
},
"run_kwargs": {
"num_beams": 4,
"length_penalty": 2.0,
"max_length": 400,
"min_length": 200,
"no_repeat_ngram_size": 3,
"early_stopping": true
}
},
{
"baseline_class": "T5",
"init_kwargs": {
"name": "T5 fine tuned",
"model_name": ["t5-base", "/content/drive/My Drive/Colab Notebooks/Multi-wiki-news/English/t5-wild-glitter-2"],
"input_max_length": 512,
"device": "cuda",
"batch_size": 8
},
"run_kwargs": {
"num_beams": 4,
"length_penalty": 2.0,
"max_length": 400,
"min_length": 200,
"no_repeat_ngram_size": 3,
"early_stopping": true
}
}
],
"dataset": {
"name": "en_wiki_multi_news_cleaned.py",
"split": "test",
"cache_dir": ".en-wiki-multi-news-cache",
"document_column_name": "document",
"summary_colunm_name": "summary"
},
"run": {
"hypotheses_folder": "hypotheses/",
"csv_file": "results.csv",
"md_file": "results.md",
"rouge_types": {
"rouge1": ["mid.fmeasure"],
"rouge2": ["mid.fmeasure"],
"rougeL": ["mid.fmeasure"]
}
}
}
The file is composed of 3 arguments:
baselines
: it defines all baselines you want to compare with for each the associateclass
,init_kwargs
which are arguments pass to theinit
function of theclass
andrun_kwargs
which are arguments pass to the run function,dataset
: it defines dataset's arguments with thename
which is the name of thenlp
dataset or the path to the dataset python script, thesplit
and thecache_dir
of the dataset (see nlpload_dataset
function),document_column_name
which is the name of the column in the dataset containing the texts to summarize andsummary_column_name
which is the name of the column in the dataset containing the summaries,run
: it defines the ROUGE run arguments with thefolder
to save hypotheses, optionnalcsv_file
andmd_file
to save results to the corresponding format androuge_types
which are the type of ROUGE scores to compute (see nlprouge
metric).
Once the file is configured you can run the computation by running:
python run_baseline.py
Results are stored to the files/folder you put in the run_args.json
file.
If you want to add your baseline you have to create a script similar to baselines/lead.py
for extractive baseline or baselines/bart.py
for abstractive baseline which contain a subclass of Baseline
and define the function def rank_sentences(self, dataset, document_column_name, **kwargs)
or def get_summaries(self, dataset, document_column_name, **kwargs)
.
For extractive baseline, the function rank_sentences
ranks all sentences of each document and add scores and sentences in a new column of the dataset. It returns the dataset.
For abstractive baseline, the function get_summaries
summaries each document and add summaries (also called hypotheses) in a new column of the dataset. It returns the dataset.
Then just add you baseline on the baselines/baselines.py
file by adding a if
and you can use your baseline.
rouge1 P | rouge1 R | rouge1 F | rouge2 P | rouge2 R | rouge2 F | rougeL P | rougeL R | rougeL F | |
---|---|---|---|---|---|---|---|---|---|
LexRank and T5 fine-tuned w/ LexRank | 46.54% | 39.30% | 40.28% | 16.09% | 13.27% | 13.73% | 23.20% | 20.03% | 20.23% |
LexRank and T5 fine-tuned w LexRank | 47.31% | 37.86% | 39.82% | 16.55% | 12.89% | 13.73% | 24.24% | 19.58% | 20.42% |
Bart CNN | 46.66% | 34.55% | 37.48% | 15.31% | 10.94% | 12.02% | 23.08% | 17.25% | 18.54% |
T5 base | 30.91% | 20.17% | 22.90% | 9.30% | 5.96% | 6.82% | 17.55% | 11.40% | 12.93% |
T5 fine tuned | 48.36% | 39.90% | 41.39% | 19.10% | 15.42% | 16.15% | 26.40% | 21.88% | 22.59% |
T5 with title fine tuned | 49.30% | 40.81% | 42.15% | 19.87% | 16.16% | 16.81% | 27.09% | 22.52% | 23.14% |
rouge1 P | rouge1 R | rouge1 F | rouge2 P | rouge2 R | rouge2 F | rougeL P | rougeL R | rougeL F | |
---|---|---|---|---|---|---|---|---|---|
Random | 38.84% | 45.03% | 39.13% | 12.81% | 14.77% | 12.85% | 17.13% | 20.45% | 17.42% |
Lead | 40.56% | 49.35% | 41.85% | 15.62% | 18.78% | 16.01% | 20.91% | 25.75% | 21.65% |
LexRank | 34.05% | 53.30% | 39.40% | 12.63% | 19.83% | 14.62% | 16.04% | 26.00% | 18.79% |
TextRank | 34.60% | 51.97% | 39.31% | 12.55% | 19.04% | 14.33% | 16.24% | 25.20% | 18.65% |
TF-IDF | 36.72% | 50.81% | 40.28% | 12.75% | 18.06% | 14.09% | 17.01% | 24.46% | 18.91% |
Rouge Oracle | 50.48% | 56.48% | 49.54% | 28.52% | 29.93% | 27.07% | 22.68% | 25.83% | 22.29% |