hooshvare / parsbert

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

Home Page:https://doi.org/10.1007/s11063-021-10528-4

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Details of your pre_training

mahdirezaey opened this issue · comments

salaam @m3hrdadfi and others
very nice to hear about your work !
Could you please provide some more information about your pre_training ?

regarding your batch size (32) , it's seems that 1.9M steps is so small (around 10 epochs)
Have you monitored loss of downstream tasks during the training and around step 1.9M the loss wasn't changing so much ?
I think you used 1.9M steps because of computation costs ? am I right ?

Have you used google's scripts to run (or Hugging face )?

Have you used TPU (or GPU) for your pre_training ? what type of it ?

Have you used distributed training ? (if yes , how many (TPU or GPU) in parallel ? )
(again, if yes , bs = 32 which you have announced in article was per GPU/TPU (or you have added them up ) ? )

have you used mixed floating point (fp16) or gradient accumulation steps ?

we can have some contribution to make ParsBert better !

Thank you so much for your concern! The whole training process took around 1.9M (Train Steps) and about six days unstoppable on TPU v2-8. And for your record, which I already know that you checked the article, we covered up many types of written in Persian, so yes we did, we considered a broad training distribution! Please check the article!