VirtualRoyalty / UD_Russian-SynTagRus

Russian data from the SynTagRus corpus.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

# Summary

Russian data from the SynTagRus corpus.

# Introduction

The SynTagRus dependency treebank is being developed by the Computational
Linguistics Laboratory, A.A.Kharkevich Institute of Information Transmission
Problems, Russian Academy of Sciences, located in Moscow.

Currently the treebank contains over 1,000,000 tokens (over 66,000 sentences)
belonging to texts from a variety of genres (contemporary fiction, popular
science, newspaper and journal articles dated between 1960 and 2016, texts of
online news etc.)

SynTagRus is a human-corrected corpus of Russian supplied
with comprehensive morphological annotation and syntactic annotation in the
form of a complete dependency tree provided for every sentence. Additionally,
the original version of SynTagRus contains other types of annotation, first of
all lexical functional annotation in terms of lexical functions as defined
in the Meaning-Text model.

It is an integral but fully autonomous part of the Russian National Corpus
developed in a nationwide research project and can be freely consulted on the
Web: http://www.ruscorpora.ru/instruction-syntax.html

For more details, see the recently published paper (in Russian):

Дяченко П.В., Иомдин Л.Л., Лазурский А.В., Митюшин Л.Г., Подлесская О.Ю.,
Сизов В.Г., Фролова Т.И., Цинман Л.Л. Современное состояние глубоко
аннотированного корпуса текстов русского языка (СинТагРус) // Сборник
«Национальный корпус русского языка: 10 лет проекту». Труды Института русского
языка им. В.В. Виноградова. М., 2015. Вып. 6. С. 272-299.

## References

* Droganova, K., Lyashevskaya, O., & Zeman, D. (2018). 
Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks. 
In Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), 
December 13–14, 2018, Oslo University, Norway (No. 155, pp. 52-65). Linköping University Electronic Press.


#Changelog
*2019-05-15 v2.4
  * enhanced representation fixed ('бы')
  * AUX for aux 'бы'

*2018-11-15 v2.3
  * Rules for punctuation fixed
  * True case lammes for PROPN
  * advmod/discource distinction
  * aux for бы (fixed some issues) 

*2018-04-15 v2.2
  * Rules for punctuation implemented
  * Rules for reported speech implemented
  * Passives fixed
  * PROPN distinguishing from NOUN improved
  * MWE fixed (underscored lemmas)

*2017-11-15 v2.1
  * Conversion rules for syntax completely rewritten
  * PROPN distinguishing from NOUN improved
  * csubj added
  * Elliptic constructions fixed
  * MWE fixed

*2017-03-15 v2.0
  * Converted to UD v2 guidelines.
  * Elliptic constructions added.
  * Compounds added.

*2016-11-15 v1.4
  * Fixed peculiar Latin/Cyrillic encoding errors.
  * Lemmas are now lowercased as in other treebanks.
  * PROPN distinguished from NOUN, using heuristics based on upper/lowercase.
  * Added "foreign" dependencies.


=== Machine-readable metadata =================================================
Data available since: UD v1.3
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: news nonfiction fiction
Lemmas: converted from manual
UPOS: converted from manual
XPOS: not available
Features: converted from manual
Relations: converted from manual
Contributors: Droganova, Kira; Lyashevskaya, Olga; Zeman, Daniel
Contributing: elsewhere
Contact: zeman@ufal.mff.cuni.cz, droganova@ufal.mff.cuni.cz
===============================================================================
Data contributors: Droganova, Kira; Lyashevskaya, Olga; Zeman, Daniel
Documentation contributors: Shakurova, Lena; Mustafina, Nina

About

Russian data from the SynTagRus corpus.

License:Other