Questions about data and retraining in general

Question

Questions about data and retraining in general

devTS123 opened this issue 9 months ago · comments

Hello,

thank you very much for this remarkable project. It is amazing how well addresses of different form and shape are recognized.
The documentation says that exactly 100,000 addresses per country were used for training. For testing significantly more.

I would be interested to know how the 100,000 addresses have been selected. Is this done randomly or is there a scheme (zip codes, structure, ...) behind it? It is for sure quite a challenge to get such a good accurancy of 99%+ with the limited number of addresses.

I am technically interested in such topics in general and would like to try to train new tags with my own data.
Would this also require 100,000 addresses to get a comparable accurancy rate?
Would the accurancy rate get significantly worse if you only take <10,000 addresses?
Would the accurancy rate be significantly better if you took even >1,000,000 addresses? Better in case for retraining...

How much data do you think it would take to learn a particular pattern of address format, which is not currently in the data set?

In the documentation the following parameters are given as an example.
Are these also the parameters with which deepparse was trained? Or what are suitable parameters?

address_parser.retrain(training_container,
                       train_ratio=0.8,
                       epochs=5,
                       batch_size=8,
                       num_workers=2,
                       callbacks=[lr_scheduler],
                       prediction_tags=tag_dictionary,
                       logging_path=logging_path)

Would I get the same accurancy rate (99%+) with the same training data set and parameters?
And how long do you think training on a GPU would take, say, for a country with 100,000 addresses?

And one last question. When I retrain the model on machine A, I get a checkpoint file (ckpt).
Can I copy that file afterwards, leave it unchanged and use it on computer B without re-training on that machine?
Is it enough to copy only this file or are other files needed?

Thank you very much for your reponses in advance!

github-actions · Answer 1 · Thu Sep 14 2023 20:19:20 GMT+0800 (China Standard Time)

Thank you for you interest in improving Deepparse.

Marouane Yassine · Answer 2 · Thu Sep 14 2023 21:32:47 GMT+0800 (China Standard Time)

Hello @devTS123,

I'm happy to learn that you are enjoying Deepparse. Here are some answers to your questions:

I would be interested to know how the 100,000 addresses have been selected. Is this done randomly or is there a scheme (zip codes, structure, ...) behind it?

The addresses for each country were actually chosen randomly.

Would this also require 100,000 addresses to get a comparable accurancy rate?

If you are training on addresses from a single country, it probably won't require as much data to get a good accuracy, depending on the complexity of said addresses.

Would the accuracy rate get significantly worse if you only take <10,000 addresses?

This can only be evaluated by running a training and testing the resulting model's performance. However, using our models gives you an advantage and will probably lead to a better performance after retraining since they have already been pretrained on many addresses.

Would the accurancy rate be significantly better if you took even >1,000,000 addresses? Better in case for retraining...

The performance of deep learning models (which are the ones used in Deepparse) usually improves with the number of data. So it's usually a good idea to use as much data as possible. However, some of our models (mainly the bpemb model) are quite complex and would require a lot of training time for that many addresses. In this case, I suggest running multiple trainings while increasing the amount of training addresses (10000, 100000, 50000....) until you hit a plateau, unless training time is not much of a concern.

How much data do you think it would take to learn a particular pattern of address format, which is not currently in the data set?

I would say few thousand addresses at the very least.

Are these also the parameters with which deepparse was trained? Or what are suitable parameters?

Not all of them. If you are interested in more information about our training process and parameters, you can take a look at our paper.

Would I get the same accuracy rate (99%+) with the same training data set and parameters?

If the original parameters from the paper are used in hand with our training data, you should get similar results.

And how long do you think training on a GPU would take, say, for a country with 100,000 addresses?

That would depend on the GPU specs, but probably a couple of hours of the fasttext models and up to a couple of days with the bpemb models. This is speculative however as it's been a while since I've trained these models.

Is it enough to copy only this file or are other files needed?

The checkpoint itself is enough to load the model and reuse it, no further training is needed on machine B.

I hope this answers your questions. I'm also preparing a little guide for the retraining process using Deepparse. It'll be available soon.

devTS123 · Answer 3 · Thu Sep 14 2023 23:25:28 GMT+0800 (China Standard Time)

Hello @MAYAS3,

an extremely big thank you for the quick and detailed explanations and answers. That has been extremely helpful. I will have a closer look at the documentation and also at your paper.

As you have answered all of my (many :-) ) questions I would close this ticket.