Pipeline for training an ML classification model for language detection of a document.
European Parliament Proceedings Parallel Corpus 1996-2011 (Koehn 2005).
The data corpus can be downloaded here.
Multiclass supervised classification based on TF-IDF weighted N-character-grams.
The folowing languages were selected:
๐ฌ๐ง English ('en')
๐ฉ๐ฐ Danish ('da')
๐ฉ๐ช German ('de')
๐ธ๐ช Swedish ('sv')
๐ฎ๐น Italian ('it').
Multinomial Naive Bayes classifier.
Koehn Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of Machine Translation Summit X: Papers. 79โ86. 13โ15 September. Phuket.