asivokon / awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

1. Datasets / Corpora

Monolingual

  • Brown-UK — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
  • UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
  • Wikipedia
  • OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
  • CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
  • mC4 — filtered CommonCrawl again, 196GB of Ukrainian text.
  • Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
  • Ukrainian forums — 250k sentences scraped from forums.
  • Ukrainain news headlines — 5.2M news headlines.

Parallel

See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.

Labeled

Dictionaries

2. Tools

  • tree_stem — stemmer
  • pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
  • LanguageTool — grammar, style and spell checker
  • Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
  • nlp-uk — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation

3. Pretrained models

Language models

Masked:

Autoregressive:

  • pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
  • UAlpaca — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
  • XGLM — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
  • Tereveni-AI/GPT-2

Mixed:

Machine translation

See Helsinki-NLP/ UkrainianLT for more.

Sequence-to-sequence models

Named-entity recognition (NER)

Part-of-speech tagging (POS)

Word embeddings

Other

4. Paid

5. Other resources and links

6. Workshops and conferences

About

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)