Multi-lingual BERT

Question

Multi-lingual BERT

danyaljj opened this issue 4 years ago · comments

Salaam,
Nice work!

Did you have a chance to try multilingual BERT?
The Transformers package has a minimal example for NER: https://github.com/huggingface/transformers/tree/master/examples/token-classification

Mehrdad Farahani · Answer 1 · Sat May 30 2020 20:04:52 GMT+0800 (China Standard Time)

Hi,
Thank you so much!

Yes, but we didn't add to our paper. For your record and the others, I updated the information on the README about mBERT results regarding NER datasets.

Dataset	ParsBERT	mBERT
PEYMA	93.10*	86.64
ARMAN	98.79*	95.89

Daniel Khashabi · Answer 2 · Wed Jun 03 2020 09:23:57 GMT+0800 (China Standard Time)

Good! I am glad you have tried mBERT.
And the numbers look reasonable to me.

Several other queries for you:

A while ago I collected lots of raw Persian text. I suspect that this is bigger than your current pretraining corpus. Feel free to use it: https://github.com/persiannlp/persian-raw-text
What model sizes are you releasing?
When are you guys planning to release the models?
Would be great to make your models available on Huggingface hub: https://huggingface.co/models this would give a lot of visibility to your work.

Mehrdad Farahani · Answer 3 · Wed Jun 03 2020 14:46:30 GMT+0800 (China Standard Time)

Thank you again!

We plan to upgrade the model shortly (actually, it is under development, and it would be published soon). Also, we have used a lot more corpus during this upgrade!
The model sizes and architecture is based on BERT-base configuration!
In fact, the model has already been released under the Huggigface model you can access from the Huggigface model from here! (also, the pipeline model on the readme implemented based on Huggigface, check it out!)

Daniel Khashabi · Answer 4 · Wed Jun 03 2020 16:40:19 GMT+0800 (China Standard Time)

Wonderful! 🥳