GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generating training data for USPS address parsing

lokkju opened this issue · comments

Note: I may be able to make whatever training dataset I create available under an open license; it's a goal for sure.

I'm trying to create a good dataset to use for training for parsing USPS (United States) addresses using proper USPS address parts. I have both an enormous dataset (millions) of clean addresses correctly labeled, as well as some (hundreds of thousands) of dirty addresses that have been matched to clean addresses, and so labeled.

Those clean addresses all use official abbreviations for Street Type, Street Pre Direction, Unit Type, etc; they also all have few missing address parts. I'd imagine I'd want to synthetically generate non-abbreviated representations of each possible abbreviated term; my questions are:

  1. Given I have labeled addresses for 90% of the entire US, how do I determine the point of diminishing returns for the size of the training set?
  2. what portion of the training dataset should consist of the alternative non-abbreviated terms?

My initial idea is to randomly select 20 addresses from every 5 digit zipcode, resulting in ~ 1M addresses; then synthetically modify about 50% of the to use alternative non-official abbreviated forms of the abbreviated terms.

Marouane and I are in the rush of writing an extended version of the original article behind Deepparse and we are at the end of our semester, so we might have a delay in our response. But rapidly here some insight that I have (feel free to continue update on your project since we are more than interested to add your dataset to our open-source one.

  1. The best way to see what is the "minimal" size of your dataset is to compute some datapoint. By that I mean retrain the model with N examples, 2N examples, 3N examples, and so on. Plot all the accuracy and you will see the global pattern. I recommend maybe a low point, some mid-range and a high end. On our side, we training using only 100 000 examples for the US to achieve our accuracy. So, If you use a million, it would definitely expect better results.

I would try first with the test() method to see what is the accuracy with your non-official abbreviated terms what is the accuracy. be sure to update to the newer release since we found two major bugs in the prediction.

It would be appreciated if you would share the results with the non-official abbreviated address here.

Also, if you have more tags, we are working on a feature where a user could retrain a model with a different tag space (see PR). It is in beta mode, maybe you could give it a try and tell us any comments (use pip install -U git+https://github.com/GRAAL-Research/deepparse.git@modify_tags_retrain to install the feature in beta mode but keep in mind that it will override the "normal" release, so reinstall the normal one with pip install -U deepparse)

  1. If we take the example of the incomplete data, we retrained the model with 100K complete data per country (20 countries) and 5K incomplete data to achieve SOTA accuracy. So, following this scheme should be a good start (following the idea of 5 per zip code maybe 1 or 2 per zip code as non-official abbreviated data).

Thanks for the reply and input; I'll keep you updated for sure!

Any update @lokkju ?