esborisova / LangDetec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

๐Ÿ—ฃ๏ธ LangDetec

Pipeline for training an ML classification model for language detection of a document.

Dataset

European Parliament Proceedings Parallel Corpus 1996-2011 (Koehn 2005).

The data corpus can be downloaded here.

Approach

Multiclass supervised classification based on TF-IDF weighted N-character-grams.

Train and test corpus

The folowing languages were selected:

๐Ÿ‡ฌ๐Ÿ‡ง English ('en')

๐Ÿ‡ฉ๐Ÿ‡ฐ Danish ('da')

๐Ÿ‡ฉ๐Ÿ‡ช German ('de')

๐Ÿ‡ธ๐Ÿ‡ช Swedish ('sv')

๐Ÿ‡ฎ๐Ÿ‡น Italian ('it').

ML algorithm

Multinomial Naive Bayes classifier.

References

Koehn Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of Machine Translation Summit X: Papers. 79โ€“86. 13โ€“15 September. Phuket.

About


Languages

Language:Jupyter Notebook 86.4%Language:Python 13.0%Language:Shell 0.6%