Details of your pre_training

Question

Details of your pre_training

mahdirezaey opened this issue 4 years ago · comments

salaam @m3hrdadfi and others
very nice to hear about your work !
Could you please provide some more information about your pre_training ?

regarding your batch size (32) , it's seems that 1.9M steps is so small (around 10 epochs)
Have you monitored loss of downstream tasks during the training and around step 1.9M the loss wasn't changing so much ?
I think you used 1.9M steps because of computation costs ? am I right ?

Have you used google's scripts to run (or Hugging face )?

Have you used TPU (or GPU) for your pre_training ? what type of it ?

Have you used distributed training ? (if yes , how many (TPU or GPU) in parallel ? )
(again, if yes , bs = 32 which you have announced in article was per GPU/TPU (or you have added them up ) ? )

have you used mixed floating point (fp16) or gradient accumulation steps ?

Mahdi_rezaei · Answer 1 · Thu May 28 2020 03:05:52 GMT+0800 (China Standard Time)

we can have some contribution to make ParsBert better !

Mehrdad Farahani · Answer 2 · Thu May 28 2020 03:46:13 GMT+0800 (China Standard Time)

Thank you so much for your concern! The whole training process took around 1.9M (Train Steps) and about six days unstoppable on TPU v2-8. And for your record, which I already know that you checked the article, we covered up many types of written in Persian, so yes we did, we considered a broad training distribution! Please check the article!