Retrain existing model with only new data

Question

Retrain existing model with only new data

onnkeat opened this issue a year ago · comments

Let's say I trained a model with 3 million addresses this year. If I get another 1 million addresses next year, can I only train the model with the new 1 million addresses only?

OR do I need to retrain the model with the whole dataset of (3 million +1 million) addresses?

github-actions · Answer 1 · Wed Jun 28 2023 15:34:28 GMT+0800 (China Standard Time)

Thank you for you interest in improving Deepparse.

Marouane Yassine · Answer 2 · Thu Jun 29 2023 03:36:55 GMT+0800 (China Standard Time)

Hello @onnkeat,

That's a good question. Both approaches may be viable depending on your use case and you data.

If the new 1 million addresses are different from the first 3 million and you want your model to generalize well on both types, it might be a good idea to retrain on the whole dataset, assuming that your resources (compute, time...) allow you to do so. If you can't afford to retrain on everything you can fine-tune the model on the new data but be aware of the possibility of some catastrophic forgetting. In this case, I would perhaps mix in some of the first 3 million addresses with the new ones to be safe, and re-evaluate on the first dataset.

If the new addresses are similar to the previous ones or if you don't care as much about the performance on the previous adresses (which is probably not the case), you could get away with a simple fine tuning of your pre-trained model.

Also, our models have been proven to do well in a zero-shot setting, so you can test performance on your new addresses before going ahead with either a retraining or a fine-tuning.

David Beauchemin · Answer 3 · Sun Jul 09 2023 08:12:10 GMT+0800 (China Standard Time)

@onnkeat, we plan, during this summer, to add documentation on our recommendation to finetune and train Deepparse.

Chong Onn Keat · Answer 4 · Sun Jul 09 2023 12:25:10 GMT+0800 (China Standard Time)

Thanks! It would be very helpful 😀