RoBERTa is an improved recipe for training BERT models that can match or exceed the performance of all of the post-BERT methods. The different between RoBERTa and BERT:
- Training the model longer, with bigger batches, over more data.
- Removing the next sentence prediction objective.
- Training on longer sequences.
- Dynamically changing the masking pattern applied to the training data.
Data to train this model is Vietnamese corpus crawled from many online newspapers: 50GB of text with approximate 7.7 billion words that crawl from many domains on the internet including news, law, entertainment, wikipedia and so on. Data was cleaned using visen library and tokenize using sentence piece. With envibert model, we use another 50GB of text in English, so a total of 100GB text is used to train envibert model.
- Download the model using the following link: envibert model, cased model, uncased model and put it in folder data-bin as the following folder structure::
model-bin
├── envibert
│ ├── dict.txt
│ ├── model.pt
│ └── sentencepiece.bpe.model
└── uncased
| ├── dict.txt
| ├── model.pt
| └── sentencepiece.bpe.model
└── cased
├── dict.txt
├── model.pt
└── sentencepiece.bpe.model
- Install environment library
pip install -r requirements.txt
Load envibert model with Huggingface
from transformers import RobertaModel
from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
import os
cache_dir='./cache'
model_name='nguyenvulebinh/envibert'
def download_tokenizer_files():
resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']
for item in resources:
if not os.path.exists(os.path.join(cache_dir, item)):
tmp_file = hf_bucket_url(model_name, filename=item)
tmp_file = cached_path(tmp_file,cache_dir=cache_dir)
os.rename(tmp_file, os.path.join(cache_dir, item))
download_tokenizer_files()
tokenizer = SourceFileLoader("envibert.tokenizer", os.path.join(cache_dir,'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
model = RobertaModel.from_pretrained(model_name,cache_dir=cache_dir)
# Encode text
text_input = 'Đại học Bách Khoa Hà Nội .'
text_ids = tokenizer(text_input, return_tensors='pt').input_ids
# tensor([[ 0, 705, 131, 8751, 2878, 347, 477, 5, 2]])
# Extract features
text_features = model(text_ids)
text_features['last_hidden_state'].shape
# torch.Size([1, 9, 768])
len(text_features['hidden_states'])
# 7
from fairseq.models.roberta import XLMRModel
# Using cased model
pretrained_path = './model-bin/envibert/'
# Load RoBERTa model. That already include loading sentence piece model
roberta = XLMRModel.from_pretrained(pretrained_path, checkpoint_file='model.pt')
roberta.eval() # disable dropout (or leave in train mode to finetune)
text_input = 'Đại học Bách Khoa Hà Nội.'
# Encode using roberta class
tokens_ids = roberta.encode(text_input)
# assert tokens_ids.tolist() == [0, 451, 71, 3401, 1384, 168, 234, 5, 2]
# Extracted feature using roberta model
tokens_embed = roberta.extract_features(tokens_ids)
# assert tokens_embed.shape == (1, 9, 512)
RoBERTa can be used to fill <mask> tokens in the input.
masked_line = 'Đại học <mask> Khoa Hà Nội'
roberta.fill_mask(masked_line, topk=5)
#('Đại học Bách Khoa Hà Nội', 0.9954977035522461, ' Bách'),
#('Đại học Y Khoa Hà Nội', 0.001166337518952787, ' Y'),
#('Đại học Đa Khoa Hà Nội', 0.0005696234875358641, ' Đa'),
#('Đại học Văn Khoa Hà Nội', 0.000467598409159109, ' Văn'),
#('Đại học Anh Khoa Hà Nội', 0.00035955727798864245, ' Anh')
This model was a custom version from RoBERTa with less hidden layers (6 layers). Three versions: envibert (with dictionary case sensitive in two languages), cased (with dictionary case sensitive) and uncased (all word is lower)
To train this model, please follow this repository instruction.
@inproceedings{nguyen20d_interspeech,
author={Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},
title={{Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models}},
year=2020,
booktitle={Proc. Interspeech 2020},
pages={4263--4267},
doi={10.21437/Interspeech.2020-1896}
}
Please CITE our repo when it is used to help produce published results or is incorporated into other software.