Generating training data for USPS address parsing

Question

Generating training data for USPS address parsing

lokkju opened this issue 3 years ago · comments

Note: I may be able to make whatever training dataset I create available under an open license; it's a goal for sure.

I'm trying to create a good dataset to use for training for parsing USPS (United States) addresses using proper USPS address parts. I have both an enormous dataset (millions) of clean addresses correctly labeled, as well as some (hundreds of thousands) of dirty addresses that have been matched to clean addresses, and so labeled.

Those clean addresses all use official abbreviations for Street Type, Street Pre Direction, Unit Type, etc; they also all have few missing address parts. I'd imagine I'd want to synthetically generate non-abbreviated representations of each possible abbreviated term; my questions are:

Given I have labeled addresses for 90% of the entire US, how do I determine the point of diminishing returns for the size of the training set?
what portion of the training dataset should consist of the alternative non-abbreviated terms?

My initial idea is to randomly select 20 addresses from every 5 digit zipcode, resulting in ~ 1M addresses; then synthetically modify about 50% of the to use alternative non-official abbreviated forms of the abbreviated terms.

David Beauchemin · Answer 1 · Tue Apr 13 2021 04:26:30 GMT+0800 (China Standard Time)

Marouane and I are in the rush of writing an extended version of the original article behind Deepparse and we are at the end of our semester, so we might have a delay in our response. But rapidly here some insight that I have (feel free to continue update on your project since we are more than interested to add your dataset to our open-source one.

The best way to see what is the "minimal" size of your dataset is to compute some datapoint. By that I mean retrain the model with N examples, 2N examples, 3N examples, and so on. Plot all the accuracy and you will see the global pattern. I recommend maybe a low point, some mid-range and a high end. On our side, we training using only 100 000 examples for the US to achieve our accuracy. So, If you use a million, it would definitely expect better results.

I would try first with the test() method to see what is the accuracy with your non-official abbreviated terms what is the accuracy. be sure to update to the newer release since we found two major bugs in the prediction.

It would be appreciated if you would share the results with the non-official abbreviated address here.

Also, if you have more tags, we are working on a feature where a user could retrain a model with a different tag space (see PR). It is in beta mode, maybe you could give it a try and tell us any comments (use pip install -U git+https://github.com/GRAAL-Research/deepparse.git@modify_tags_retrain to install the feature in beta mode but keep in mind that it will override the "normal" release, so reinstall the normal one with pip install -U deepparse)

If we take the example of the incomplete data, we retrained the model with 100K complete data per country (20 countries) and 5K incomplete data to achieve SOTA accuracy. So, following this scheme should be a good start (following the idea of 5 per zip code maybe 1 or 2 per zip code as non-official abbreviated data).

Loki Coyote · Answer 2 · Fri Apr 16 2021 12:37:10 GMT+0800 (China Standard Time)

Thanks for the reply and input; I'll keep you updated for sure!

David Beauchemin · Answer 3 · Thu Aug 26 2021 04:16:02 GMT+0800 (China Standard Time)

Any update @lokkju ?