plur: Pre-trained Language Models for Under-represented Languages

This repository contains pre-trained language models for under-represented languages in NLP.

Language models are available for Flair and ELMo (soon: XLNet). All trained language models are evaluated on NER and PoS tagging downstream tasks with Flair.

Basque

Corpus

Flair Embeddings and ELMo are trained on a recent Wikipedia dump and various texts are collected from OPUS and the Leipzig Corpora Collection.

Some statistics:

Number of tokens: 57,110,741 (untokenized), 72,683,662 (tokenized)
Size: 417M (untokenized), 440M (tokenized)

Remember: Flair Embeddings are trained on raw and untokenized texts, so no tokenization is needed. The underlying language model is a character-based one, in contrast to ELMo: ELMo needs tokenized input. For tokenization we use a very simple tokenization method that is adopted from the Tensor2Tensor repository.

ELMo

We use the official implementation from the bilm-tf repository. Due to limited hardware resources, we limit the vocabulary to 700,000 tokens. We train for 10 epochs on a GTX 1080.

Release:

Flair import

The trained ELMo model can easily be used in Flair:

from flair.embeddings import ELMoEmbeddings

embeddings = ELMoEmbeddings(options_file="https://schweter.eu/cloud/eu-elmo/options.json", 
                            weight_file="https://schweter.eu/cloud/eu-elmo/weights.hdf5")

Flair Embeddings

We follow the official recommendations for training Flair Embeddings from the Flair documentation.

The following parameters are used:

Parameter	Value
`hidden_size`	2048
`dropout`	0.1
`nlayers`	1
`sequence_length`	250
`mini_batch_size`	100
`max_epochs`	10
`learning_rate`	20

We did not decrease the initial learning rate during training.

Release:

Flair import

from flair.embeddings import FlairEmbeddings

embeddings_forward  = FlairEmbeddings("lm-eu-opus-large-forward-v0.2.pt")
embeddings_backward = FlairEmbeddings("lm-eu-opus-large-backward-v0.2.pt")

Notice: Our trained embeddings are included in Flair >= 0.4.3. So you can easily load them with:

from flair.embeddings import FlairEmbeddings

embeddings_forward  = FlairEmbeddings("eu-forward")
embeddings_backward = FlairEmbeddings("eu-backward")

NER

We use the Basque Named Entities Corpus (EIEC) that can be obtained from here. This corpus has a total of 2552 training and 842 test sentences. For evaluation, the official CoNLL-2003 evaluation script is used. We report averaged F-Score over three runs.

Language model	Run 1	Run 2	Run 3	Final F-Score
ELMo	81.50	83.13	81.41	82.01
Flair Embeddings	81.62	81.56	81.51	81.56

UD

We use the Basque Universal Dependencies in version 1.2 for comparison. The corpus has a total of 5,396 training, 1,798 development and 1,799 test sentences. We report averaged accuracy over three runs.

Language model	Run 1	Run 2	Run 3	Final Accuracy
ELMo	97.35	97.33	97.38	97.35
Flair Embeddings	97.60	97.67	97.67	97.65
mBERT uncased	95.06	94.62	94.70	94.79
mBERT cased	94.26	94.43	94.33	94.35

WikiANN

Experiments on the WikiANN dataset for Basque are coming soon.

Tamil

Corpus

Flair Embeddings and ELMo are trained on a recent Wikipedia dump and various texts are collected from OPUS and the Leipzig Corpora Collection.

Some statistics:

Number of tokens: 18,365,106 (untokenized), 21,581,878 (tokenized)
Size: 423M (untokenized), 426M (tokenized)

ELMo

We use the official implementation from the bilm-tf repository. Due to limited hardware resources, we limit the vocabulary to 700,000 tokens. We train for 10 epochs on a GTX 1080.

Release:

Flair import

The trained ELMo model can easily be used in Flair:

from flair.embeddings import ELMoEmbeddings

embeddings = ELMoEmbeddings(options_file="https://schweter.eu/cloud/ta-elmo/options.json",
                            weight_file="https://schweter.eu/cloud/ta-elmo/weights.hdf5")

Flair Embeddings

We follow the official recommendations for training Flair Embeddings from the Flair documentation.

The following parameters are used:

Parameter	Value
`hidden_size`	2048
`dropout`	0.1
`nlayers`	1
`sequence_length`	250
`mini_batch_size`	100
`max_epochs`	10
`learning_rate`	20

We did not decrease the initial learning rate during training.

Release:

Flair import

from flair.embeddings import FlairEmbeddings

embeddings_forward  = FlairEmbeddings("lm-ta-opus-large-forward-v0.1.pt")
embeddings_backward = FlairEmbeddings("lm-ta-opus-large-forward-v0.1.pt")

Notice: Our trained embeddings are included in Flair >= 0.4.3. So you can easily load them with:

from flair.embeddings import FlairEmbeddings

embeddings_forward  = FlairEmbeddings("ta-forward")
embeddings_backward = FlairEmbeddings("ta-backward")

UD

We use the Tamil Universal Dependencies in version 1.2 for comparison. The corpus has a total of 400 training, 80 development and 120 test sentences. We report averaged accuracy over three runs. We use Subword Embeddings with different vocabulary sizes and a fixed dimension of 300 for both Flair and ELMo models.

Flair

BPE vocab	Run 1	Run 2	Run 3	Final Accuracy
200,000	92.31	91.55	92.46	92.11
100,000	92.06	92.51	92.51	92.36
50,000	92.51	92.61	93.11	92.74
25,000	92.61	92.06	92.81	92.49
10,000	91.86	92.31	91.30	91.82
5,000	92.06	92.56	92.51	92.37
3,000	92.31	92.86	92.76	92.64
1,000	92.41	92.36	93.31	92.69

ELMo

BPE vocab	Run 1	Run 2	Run 3	Final Accuracy
200,000	91.91	91.45	92.76	92.04
100,000	91.96	92.01	92.16	92.04
50,000	91.96	92.46	91.75	92.06
25,000	92.26	90.90	92.11	91.76
10,000	91.91	91.50	91.65	91.69
5,000	92.36	91.55	91.91	91.94
3,000	92.06	91.96	92.06	92.03
1,000	92.06	91.80	91.70	91.85

ToDo

WikiANN experiments
Run NER and PoS tagging experiments on (already) trained XLNet models
Add training scripts
Play around with allennlp to add configuration for training NER and PoS tagging models

About

Pre-trained Language Models for Under-represented Languages in NLP

elmo flair nlp under-represented