GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Training noisy data from another country?

tk512 opened this issue · comments

If I have a large dataset with noisy raw addresses and also correctly parsed results for each one, how do I start with training deepparse to get a trained dataset?

The raw+result data I have is currently in CSV format but with a bit of scripting I can easily transform into another format. I just don't completely understand how to train Deepparse for this.

Hello!

Allow me to refer you to our docs (https://deepparse.org/examples/fine_tuning.html) where a complete training exemple is present.

In short all you have to do is transform your data into a list of tuples where the first element of each tuple is the address in a string format and the second element is a list of tags corresponding to the words in the address (e.g: ('south fraser way abbotsford b. c. v2t 1v6', ['StreetName', 'StreetName', 'StreetName', 'Municipality', 'Province', 'Province', 'PostalCode', 'PostalCode'])). Once your list is created you can save it using python's pickle module and instantiate our PickleDatasetContainer with the corresponding path. After that you can create an AddressParser with the parameters that you want (mainly chose the model type and the training device). The rest should be quite similar to the example in the docs.

Note that we only support the following list of tags currently:

  • “StreetNumber”: for the street number

  • “StreetName”: for the name of the street

  • “Unit”: for the unit (such as apartment)

  • “Municipality”: for the municipality

  • “Province”: for the province or local region

  • “PostalCode”: for the postal code

  • “Orientation”: for the street orientation (e.g. west, east)

  • “GeneralDelivery”: for other delivery information

Do this help clear things up?

Note that we are working on newer models with the tag "country" (#57).

We are also looking to publish our clean and noisy dataset soon, and we are looking for new data. Is it possible for you to share your data under an MIT license (or something similar)?

e.g: ('south fraser way abbotsford b. c. v2t 1v6', ['StreetName', 'StreetName', 'StreetName', 'Municipality', 'Province', 'Province', 'PostalCode', 'PostalCode'])).

Note that we only support the following list of tags currently:

  • “StreetNumber”: for the street number
  • “StreetName”: for the name of the street
  • “Unit”: for the unit (such as apartment)
  • “Municipality”: for the municipality
  • “Province”: for the province or local region
  • “PostalCode”: for the postal code
  • “Orientation”: for the street orientation (e.g. west, east)
  • “GeneralDelivery”: for other delivery information

Do this help clear things up?

Hi, just want to make sure: Is it typos there on the example (3 "StreetName")?

Also, should it be tuple in tuple ((data1), (data2)) or tuple in list [(data1), (data2)]? I tried with list but get KeyError from PyTorch. Should it be tuple in tuple or what?

It's not a typo. What we actually do is return a tag for each word and since south fraser way is a street name we assign the StreetName tag to each word and then let you choose the way you wish to process the output.

For your second point, it should be a list of tuples: [(address1, tags1), (address2, tags2), ....]

Can you share the stack trace of your error so I get a better idea of where the problem lies?

OOOHHH. I'm the one that's wrong here, hehe. Okay got it.

What if I have a sentence that don't belong to any tag? For example: We sell apple at Orchid Park Street.

Here: we, sell, apple, and at don't belong to any tag but I want to get the "Orchid Park Street" instead for StreetName tag.

Edit: Gotcha. It should be GeneralDelivery instead. Thanks!

Yes exactly. But I don't think we have seen addresses like that, if you see a drop in performance, maybe do a little retraining.

If you have numerous addresses like that, we would be interested to have those to create a large database and made them available for the community.

Unfortunately it's not possible for me to share the dataset since it's not owned by me and only for a competition only :)