Advice about training with additional synthetic dataset

Question

Advice about training with additional synthetic dataset

rachelwrr opened this issue 3 years ago · comments

Hi,

Thanks for the work!

Just seeking for advice. If I want to feed in with additional synthetic data set targeting a few specific grammar errors, what order will you recommend me to train the model? Will mixing up the order of 3 training stages affect the result?

Fine tune on the top of your pretrained model (after Stage 3)?
Or
Restart the training process, and include those new dataset in Stage 1?

I'm new in this area. Any advice will be appreciated :)

Thanks!

Alex Skurzhanskyi · Answer 1 · Wed Jul 07 2021 22:44:19 GMT+0800 (China Standard Time)

Hi
I think this depends on how much your errors differ from those in the dataset. In general, I would suggest adding these errors to Stage 1 and then applying Stage 2 & 3, as your data is synthetic.

rachelwrr · Answer 2 · Thu Jul 08 2021 11:13:02 GMT+0800 (China Standard Time)

errors to Stage 1 and then applying Stage 2 & 3, as your data is synthetic.

Thanks for the reply! For dataset, I took 60000 sentences from PIE folder a5 (true), then convert adj to adv, intending to improve adj. / adv. conversion related grammar errors.