GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Retrain existing model with only new data

onnkeat opened this issue · comments

Let's say I trained a model with 3 million addresses this year. If I get another 1 million addresses next year, can I only train the model with the new 1 million addresses only?

OR do I need to retrain the model with the whole dataset of (3 million +1 million) addresses?

Thank you for you interest in improving Deepparse.

Hello @onnkeat,

That's a good question. Both approaches may be viable depending on your use case and you data.

If the new 1 million addresses are different from the first 3 million and you want your model to generalize well on both types, it might be a good idea to retrain on the whole dataset, assuming that your resources (compute, time...) allow you to do so. If you can't afford to retrain on everything you can fine-tune the model on the new data but be aware of the possibility of some catastrophic forgetting. In this case, I would perhaps mix in some of the first 3 million addresses with the new ones to be safe, and re-evaluate on the first dataset.

If the new addresses are similar to the previous ones or if you don't care as much about the performance on the previous adresses (which is probably not the case), you could get away with a simple fine tuning of your pre-trained model.

Also, our models have been proven to do well in a zero-shot setting, so you can test performance on your new addresses before going ahead with either a retraining or a fine-tuning.

@onnkeat, we plan, during this summer, to add documentation on our recommendation to finetune and train Deepparse.

Thanks! It would be very helpful 😀