Retraining With Our Own Data

Question

Retraining With Our Own Data

zubairshahzadarain opened this issue 2 years ago · comments

Zubair Shahzad Arain commented 2 years ago

Dear Team
I have an address dataset of countries that are not on your list.
I want to train a model, but I am facing an issue with how to prepare data so I can use it for training.
I have an addresses dataset, but how I can give tags to all addresses; it's 4.5 million addresses.
I have seen the sample file you use to tag every word in the address, but is there any way to tag all datasets
please guide

Thanks
sorry about my English

github-actions · Answer 1 · Fri Jul 15 2022 14:48:33 GMT+0800 (China Standard Time)

Thank you for you interest in improving Deepparse.

Zubair Shahzad Arain · Answer 2 · Fri Jul 15 2022 16:11:59 GMT+0800 (China Standard Time)

can you support how i can prepare dataset from addresses
how i can taging to 4.5 million addresses
i have addresses

David Beauchemin · Answer 3 · Fri Jul 15 2022 22:32:27 GMT+0800 (China Standard Time)

Hi @ZubairShahzad,

We offer code examples to

retrain a parsing model that guides you into how one can use annotated (already parsed address) to improve performance, and
retrain with new prediction tags that will guide you on how one can change the tag set and retrain the last prediction layer to predict which tags are good for you.

Regarding a strategy for developing a dataset for your country, I would recommend a bootstrapping approach. Namely,
start by parsing something like a thousand addresses, manually fix the errors, retrain the model with those new examples and parse new addresses, validate the annotation of those new examples, and retrain and so on until performance is enough. Each step should help improve performance and reduce the time needed to validate the parsed address. But depending on the address's country, we could use other tricks to speed up the annotation (domain transfer).

After that, if you are willing to share the dataset, we would be more than happy to introduce your dataset in our public one available here.

Marouane Yassine · Answer 4 · Fri Jul 15 2022 23:08:57 GMT+0800 (China Standard Time)

Hello @ZubairShahzad,

I think the strategy proposed by @davebulaval is your best bet. If I could add one thing it would be to try and clean your data a little bit by removing unnecessary punctuation and lowercasing your addresses to better match the state of addresses in the original dataset. This should limit model errors which would be due to the address structure.

As pointed out by @davebulaval, if you are willing to share your data, we could also recommend a preprocessing strategy.

github-actions · Answer 5 · Sun Sep 18 2022 08:10:18 GMT+0800 (China Standard Time)

This issue is stale because it has been open 60 days with no activity.
Stale issues will automatically be closed 30 days after being marked Stale
.