summarization deep-learning machine-learning nlp pytorch

summarus

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta

Based on the following papers:

Contacts

Telegram: @YallenGusev

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

Argument	Required	Description
-c	true	path to file with configuration
-s	true	path to directory where model will be saved
-t	true	path to train dataset
-v	true	path to val dataset
-r	false	recover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

Argument	Required	Default	Description
-t	true		path to test dataset
-m	true		path to tar.gz archive with model
-p	true		name of Predictor
-c	false	0	CUDA device
-L	true		Language ("ru" or "en")
-b	false	32	size of a batch with test examples to run simultaneously
-M	false		path to meteor.jar for Meteor metric
-T	false		tokenize gold and predicted summaries before metrics calculation
-D	false		save temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

Argument	Default	Description
--train-path		path to train dataset
--model-path		path to directory where generated subword model will be saved
--model-type	bpe	type of subword model, see sentencepiece
--vocab-size	50000	size of the resulting subword model vocabulary
--config-path		path to file with configuration for DatasetReader (with parse_set)

Headline generation

First paper: Importance of Copying Mechanism for News Headline Generation
Slides: Importance of Copying Mechanism for News Headline Generation
Second paper: Advances of Transformer-Based Models for News Headline Generation

Dataset splits:

RIA original dataset: https://github.com/RossiyaSegodnya/ria_news_dataset
RIA train/val/test: https://www.dropbox.com/s/rermx1r8lx9u7nl/ria.tar.gz
RIA dataset preprocessed for mBART: https://www.dropbox.com/s/iq2ih8sztygvz0m/ria_data_mbart_512_200.tar.gz
Lenta original dataset: https://github.com/yutkin/Lenta.Ru-News-Dataset
Lenta train/val/test: https://www.dropbox.com/s/v9i2nh12a4deuqj/lenta.tar.gz
Lenta dataset preprocessed for mBART: https://www.dropbox.com/s/4oo8jazmw3izqvr/lenta_mbart_data_512_200.tar.gz
Telegram train dataset with split: https://www.dropbox.com/s/ykqk49a8avlmnaf/ru_all_split.tar.gz
Telegram test dataset with multiple references: https://github.com/dialogue-evaluation/Russian-News-Clustering-and-Headline-Generation/blob/main/data/headline_generation/headline_generation_answers.jsonl

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru

Results

Train dataset: RIA, test dataset: RIA

Model	R-1-f	R-2-f	R-L-f	BLEU
ria_copynet_10kk	40.0	23.3	37.5	-
ria_pgn_24kk	42.3	25.1	39.6	-
ria_mbart	42.8	25.5	39.9	-
First Sentence	24.1	10.6	16.7	-

Train dataset: RIA, eval dataset: Lenta

Model	R-1-f	R-2-f	R-L-f	BLEU
ria_copynet_10kk	25.6	12.3	23.0	-
ria_pgn_24kk	26.4	12.3	24.0	-
ria_mbart	30.3	14.5	27.1	-
First Sentence	25.5	11.2	19.2	-

Summarization - CNN/DailyMail

Dataset splits:

CNN/DailyMail jsonl dataset: https://www.dropbox.com/s/35ezpg78rtukkgh/cnn_dm_jsonl.tar.gz

Models:

cnndm_pgn_25kk

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

Model	R-1-f	R-2-f	R-L-f	METEOR	BLEU
cnndm_pgn_25kk	38.5	16.5	33.4	17.6	-

Summarization - Gazeta, russian news dataset

Paper: Dataset for Automatic Summarization of Russian News
Gazeta dataset: https://github.com/IlyaGusev/gazeta
Usage examples:

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

Model	R-1-f	R-2-f	R-L-f	METEOR	BLEU
gazeta_pgn_7kk	29.4	12.7	24.6	21.2	9.0
gazeta_pgn_7kk_cov	29.8	12.8	25.4	22.1	10.1
gazeta_pgn_25kk	29.6	12.8	24.6	21.5	9.3
gazeta_pgn_words_13kk	29.4	12.6	24.4	20.9	8.9
gazeta_summarunner_3kk	31.6	13.7	27.1	26.0	11.5
gazeta_mbart	32.6	14.6	28.2	25.7	12.4
gazeta_mbart_lower	32.7	14.7	28.3	25.8	12.5

Demo

python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>

Citations

Headline generation (PGN):

@article{Gusev2019headlines,
    author={Gusev, I.O.},
    title={Importance of copying mechanism for news headline generation},
    journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
    year={2019},
    volume={2019-May},
    number={18},
    pages={229--236}
}

Headline generation (transformers):

@InProceedings{Bukhtiyarov2020headlines,
    author={Bukhtiyarov, Alexey and Gusev, Ilya},
    title="Advances of Transformer-Based Models for News Headline Generation",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages={54--61},
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_4}
}

Summarization:

@InProceedings{Gusev2020gazeta,
    author="Gusev, Ilya",
    title="Dataset for Automatic Summarization of Russian News",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages="{122--134}",
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_9}
}

About

Models for automatic abstractive summarization

summarization deep-learning machine-learning nlp pytorch

Apache License 2.0

Languages

Language:Python 80.7%Language:Jsonnet 13.7%Language:Shell 3.5%Language:Gherkin 2.0%