[Question] What was the size of English, Nepali, and Hindi data, multilingual BERT cased was trained on?

Question

[Question] What was the size of English, Nepali, and Hindi data, multilingual BERT cased was trained on?

mani-rai opened this issue 2 years ago · comments

I am writing a thesis which references mBERT a lot. And would be really great to know data sizes of English, Nepali, and Hindi used for training. In other papers, they mention in either ranges or total size which include all of the languages. However, I just wanted for these three. Also, wikipedia mentions its English data is 300 GB of size which I don't think mBERT was trained on. Anybody knows the sizes?