google-research / bert

TensorFlow code and pre-trained models for BERT

Home Page:https://arxiv.org/abs/1810.04805

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] What was the size of English, Nepali, and Hindi data, multilingual BERT cased was trained on?

mani-rai opened this issue · comments

I am writing a thesis which references mBERT a lot. And would be really great to know data sizes of English, Nepali, and Hindi used for training. In other papers, they mention in either ranges or total size which include all of the languages. However, I just wanted for these three. Also, wikipedia mentions its English data is 300 GB of size which I don't think mBERT was trained on. Anybody knows the sizes?