manhlab / bert-vietnamese-base

BERT Vietnamese version TF1.14.0

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pretrained Vietnamese BERT models

This is a repository of pretrained Vietnamese BERT models. The pretrained models are available along with the source code of pretraining.


  • All the models are trained on Vietnamese Wikipedia.

  • All the models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

  • We also distribute models trained with Whole Word Masking enabled; all of the tokens corresponding to a word (tokenized by Underthesea) are masked at once.

  • Along with the models, we provide tokenizers, which are compatible with ones defined in Transformers by Hugging Face.

  • using underthesea library to to sentence processing.

  • v2 Update VietNameseTokenierNormalize

Pretrained models

  • BERT-base models (12-layer, 768-hidden, 12-heads, 110M parameters)

All the model archives include following files. pytorch_model.bin and tf_model.h5 are compatible with Transformers.

├── config.json
├── model.ckpt.index
├── model.ckpt.meta
├── pytorch_model.bin
├── tf_model.h5
└── vocab.txt

At present, only BERT-base models are available. I am planning to release BERT-large models in the future.


For just using the models:

If you wish to pretrain a model:


Please refer to masked_lm_example.ipynb.

Details of pretraining

Corpus generation and preprocessing

The all distributed models are pretrained on Vietnamese Wikipedia. To generate the corpus, WikiExtractor is used to extract plain texts from a Wikipedia dump file.

$ wget
$ python wikiextractor/ --output /corpus --bytes 512M --compress --json --links --namespaces 0 --no_templates --min_text_length 16 --processes 20 viwiki-20200520-pages-articles-multistream.xml.bz2

install requirements library
$ sudo bash
Some preprocessing is applied to the extracted texts.
Preprocessing includes splitting texts into sentences, removing noisy markups, etc.

$ seq -f %02g 0 3|xargs -L 1 -I {} -P 9 python bert-vietnamese/ --input_file /corpus/AA/wiki_{}.bz2 --output_file /corpus/corpus.txt.{} --vina_dict_path /path/to/neologd/dict/dir/

Building vocabulary

Same as the original BERT, we used byte-pair-encoding (BPE) to obtain subwords. We used a implementaion of BPE in SentencePiece.

# For vocab models
$ !python bert-vietnamese/ --input_file "/corpus/corpus.txt.*" --output_file "/base/vocab.txt" --subword_type bpe --vocab_size 32000

Creating data for pretraining

With the vocabulary and text files above, we create dataset files for pretraining. Note that this process is highly memory-consuming and takes many hours.

# For 32k w/ whole word masking
# Note: each process will consume about 32GB RAM
$ seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python --input_file /path/to/corpus/dir/corpus.txt.{} --output_file /path/to/base/dir/pretraining-data.tf_record.{} --do_whole_word_mask True --vocab_file /path/to/base/dir/vocab.txt --subword_type bpe --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15

# Note: each process will consume about 32GB RAM
$ !seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python bert-vietnamese/ --input_file /corpus/corpus.txt.{} --output_file /base/pretraining-data.tf_record.{} --do_whole_word_mask True --vocab_file /base/vocab.txt --subword_type bpe --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15


We used Cloud TPUs to run pre-training.

For BERT-base models, v3-8 TPUs are used.

# For BERT-base models
$ python3 \
--input_file="/path/to/pretraining-data.tf_record.*" \
--output_dir="/path/to/output_dir" \
--bert_config_file=bert_base_32k_config.json \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--do_train=True \
--train_batch_size=256 \
--num_train_steps=1000000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=100000 \
--keep_checkpoint_max=10 \
--use_tpu=True \
--tpu_name=<tpu name> \


  • Model can use with transformers:
    • Tokenizer:
    • Model:


The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0.

The codes in this repository are distributed under the MIT License.

Related Work


For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program. Thanks for Japanese BERT !


BERT Vietnamese version TF1.14.0

License:Apache License 2.0


Language:Jupyter Notebook 90.4%Language:Python 9.6%Language:Shell 0.0%