RuCoLA

The Russian Corpus of Linguistic Acceptability (RuCoLA) is a dataset consisting of Russian language sentences with their binary acceptability judgements. It includes expert-written sentences from linguistic publications and machine-generated examples.

The corpus covers a variety of language phenomena, ranging from syntax and semantics to generative model hallucinations. We release RuCoLA to facilitate the development of methods for identifying errors in natural language and create a public leaderboard to track the progress on this problem.

The dataset is available in the data/ folder of the repository.

Read more about RuCoLA in the corresponding blog post (in Russian).

Sources

Linguistic publications and resources

Original source	Transliterated source	Source id
Проект корпусного описания русской грамматики	Proekt korpusnogo opisaniya russkoj grammatiki	Rusgram
Тестелец, Я.Г., 2001. Введение в общий синтаксис. Федеральное государственное бюджетное образовательное учреждение высшего образования Российский государственный гуманитарный университет.	Yakov Testelets. 2001. Vvedeniye v obschiy sintaksis. Russian State University for the Humanities.	Testelets
Лютикова, Е.А., 2010. К вопросу о категориальном статусе именных групп в русском языке. Вестник Московского университета. Серия 9. Филология, (6), pp.36-76.	Ekaterina Lutikova. 2010. K voprosu o kategorial’nom statuse imennykh grup v russkom yazyke. Moscow University Philology Bulletin.	Lutikova
Митренина, О.В., Романова, Е.Е. and Слюсарь, Н.А., 2017. Введение в генеративную грамматику. Общество с ограниченной ответственностью "Книжный дом ЛИБРОКОМ".	Olga Mitrenina et al. 2017. Vvedeniye v generativnuyu grammatiku. Limited Liability Company “LIBROCOM”.	Mitrenina
Падучева, Е.В., 2004. Динамические модели в семантике лексики. М.: Языки славянской культуры.	Elena Paducheva. 2004. Dinamicheskiye modeli v semantike leksiki. Languages of Slavonic culture.	Paducheva2004
Падучева, Е.В., 2010. Семантические исследования: Семантика времени и вида в русском языке; Семантика нарратива. М.: Языки славянской культуры.	Elena Paducheva. 2010. Semanticheskiye issledovaniya: Semantika vremeni i vida v russkom yazyke; Semantika narrativa. Languages of Slavonic culture.	Paducheva2010
Падучева, Е.В., 2013. Русское отрицательное предложение. М.: Языки славянской культуры	Elena Paducheva. 2013. Russkoye otritsatel’noye predlozheniye. Languages of Slavonic culture.	Paducheva2013
Селиверстова, О.Н., 2004. Труды по семантике. М.: Языки славянской культуры	Olga Seliverstova. 2004. Trudy po semantike. Languages of Slavonic culture.	Seliverstova
Набор данных ЕГЭ по русскому языку	Shavrina et al. 2020. Humans Keep It One Hundred: an Overview of AI Journey	USE5, USE7, USE8

Machine-generated sentences

Datasets

Original source	Source id
Mikel Artetxe and Holger Schwenk. 2019. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond	Tatoeba
Holger Schwenk et al. 2021. WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia	WikiMatrix
Ye Qi et al. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?	TED
Alexandra Antonova and Alexey Misyurev. 2011. Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text	YandexCorpus

Models

EasyNMT models:

OPUS-MT. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – Building open translation services for the World
M-BART50. Yuqing Tang et al. 2020. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
M2M-100. Angela Fan et al. 2021. Beyond English-Centric Multilingual Machine Translation

Paraphrase generation models:

ruGPT2-Large
ruT5
mT5. Linting Xue et al. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Baselines

Requirements

To set up the environment, you will need to install a recent version of PyTorch, scikit-learn and Transformers. The requirements.txt file contains the list of top-level packages used to train all models below.

Models

Our code supports acceptability classification evaluation for the following scikit-learn and Hugging Face🤗 Transformers models:

Running experiments

To train acceptability classifiers on RuCoLA, run the following commands.

scikit-learn models:

python baselines/train_sklearn_baselines.py -m {majority, logreg}

Masked Language Model (e.g. ruBERT or XLM-R) finetuning:

python baselines/finetune_mlm.py -m [MODEL_NAME]

ruT5 finetuning:

python baselines/finetune_t5.py -m [MODEL_NAME]

Afterwards, you can get test set predictions in the format required by the leaderboard for all trained models. To do this, run python baselines/get_csv_predictions.py -m MODEL1 MODEL2 ....

Contact us

For any questions about RuCoLA or this website, write to contact@rucola-benchmark.com or create an issue in this repository.

You can also join the official Telegram chat to discuss your ideas with others and to follow the latest updates of the project.

Cite

If you use RuCoLA in your research, please cite the dataset using the following entry:

@dataset{vladislav_mikhailov_2022_6560847,
  author       = {Vladislav Mikhailov and
                  Tatiana Shamardina and
                  Max Ryabinin and
                  Alena Pestova and
                  Ivan Smurov and
                  Ekaterina Artemova},
  title        = {RuCoLA benchmark},
  month        = may,
  year         = 2022,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.6560847},
  url          = {https://doi.org/10.5281/zenodo.6560847}
}

License

Our baseline code and acceptability labels are available under the Apache 2.0 license. The copyright (where applicable) of texts from the linguistic publications and resources remains with the original authors or publishers.

IlyaGusev / RuCoLA