SlavicaJ / spacy-serbian-pipeline

A pipeline for creating a language model for Serbian in spaCy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Serbian Language Pipeline for Spacy

Work in progress. Far from production ready.

How to use with Spacy?

...

Data files

For testing training, we're using the UD dataset, which has been automatically converted to Cyrillic. This is temporary. We will eventually use our own training data.

Lemmatizer data

  • data originates from Morpho-SLaWS (Tasovac, Rudan and Rudan 2015) and Transpoetika (Tasovac 2012)
  • currently includes both Ekavian and Jekavian forms, I may move Jekavians to the normalization function

About

A pipeline for creating a language model for Serbian in spaCy

License:GNU General Public License v3.0


Languages

Language:Python 98.1%Language:Shell 1.5%Language:Awk 0.4%