nikolayVv / MultiParaphrase

Comparing and evaluating monolingual paraphrasing of English, German, Czech, and Slovene sentences, along with multilingual paraphrasing across these languages.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multillingual paraphrasing of sentences

This project compares and evaluates monolingual paraphrasing of English, German, Czech, and Slovene sentences, along with multilingual paraphrasing across these languages. Monolingual datasets are generated and evaluated using human evaluation. By comparing their scores with the scores of existing monolingual datasets, we estimate their quality. The models, on the other hand, are evaluated using the metric Parascore, which helps us analyse the effectiveness of each model for each language. This way we discover the advantages and disadvantages that come with using a mono- or multilingual dataset and training for multilingual paraphrasing of sentences.

Data

The repository doesn't collect any new dataset. Instead, we have decided to leverage the already existing ones. We use the ParaCrawl dataset which consists of lots of sentences in different languages. We use maching translation models from huggingface to create paraphrase data from this translation dataset. While other multilingual parallel datasets include sentence pairs within a language (i.e. paraphrases), they include only few if any of these paraphrase sentence pairs in medium resource languages like Slovene. With our approach we create similarly sized paraphrase datasets for different languages including medium resource languages by leveraging translation data, which is more widely available than paraphrase data.

Our generated data can be accessed on huggingface:

Dataset Evaluation

We evaluate the quality of our monolingual datasets via human evaluation of a dataset sample and in direct comparison to other popular paraphrase datasets. We evaluate semantic similarity and lexical divergence and calculate a score base on their combination. The human evaluation results of the 4 generated monolingual datasets are shown in the following table:

Language Our dataset Tatoeba
en-en 0.256 0.307
de-de 0.291 0.588
sl-sl 0.271 0.015
cs-cs 0.189 0.210

Models

We train 6 different mt5 models, one for each of the datasets we have created. We refer to these models as mono- and multilingual models, even though they are originally multilingual mt5 models, because we train them on the generated mono- and multilingual datasets.

Our trained models can be accessed on huggingface:

Model Evaluation

We used the Parascore metric to evaluate all models. The Parascore evaluation results of the 4 monolingual trained models are shown in the following table:

Language Parascore score
en-en 0.961
de-de 0.925
sl-sl 0.890
cs-cs 0.922

The Parascore evaluation results of the 2 multilingual trained models are shown in the following table, which also shows the average scores for each of the 4 different language subparts of the test split of the multilingual dataset:

Language Score multi-small Score multi-all
whole test set 0.925 0.925
English part 0.938 0.939
German part 0.926 0.925
Slovene part 0.915 0.914
Czech part 0.922 0.922

Authors

  • Nikolay Vasilev
  • Jannik Weiß (YAWNICK)
  • Jan Jenicek (hjeni)

About

Comparing and evaluating monolingual paraphrasing of English, German, Czech, and Slovene sentences, along with multilingual paraphrasing across these languages.

License:Apache License 2.0


Languages

Language:Jupyter Notebook 99.4%Language:Python 0.6%