[Question] Training noisy data from another country?

Question

[Question] Training noisy data from another country?

tk512 opened this issue 3 years ago · comments

Torbjørn Kristoffersen commented 3 years ago

If I have a large dataset with noisy raw addresses and also correctly parsed results for each one, how do I start with training deepparse to get a trained dataset?

The raw+result data I have is currently in CSV format but with a bit of scripting I can easily transform into another format. I just don't completely understand how to train Deepparse for this.

Marouane Yassine · Answer 1 · Mon Feb 15 2021 06:47:59 GMT+0800 (China Standard Time)

Hello!

Allow me to refer you to our docs (https://deepparse.org/examples/fine_tuning.html) where a complete training exemple is present.

In short all you have to do is transform your data into a list of tuples where the first element of each tuple is the address in a string format and the second element is a list of tags corresponding to the words in the address (e.g: ('south fraser way abbotsford b. c. v2t 1v6', ['StreetName', 'StreetName', 'StreetName', 'Municipality', 'Province', 'Province', 'PostalCode', 'PostalCode'])). Once your list is created you can save it using python's pickle module and instantiate our PickleDatasetContainer with the corresponding path. After that you can create an AddressParser with the parameters that you want (mainly chose the model type and the training device). The rest should be quite similar to the example in the docs.

Note that we only support the following list of tags currently:

“StreetNumber”: for the street number
“StreetName”: for the name of the street
“Unit”: for the unit (such as apartment)
“Municipality”: for the municipality
“Province”: for the province or local region
“PostalCode”: for the postal code
“Orientation”: for the street orientation (e.g. west, east)
“GeneralDelivery”: for other delivery information

Do this help clear things up?

David Beauchemin · Answer 2 · Wed Feb 17 2021 23:20:43 GMT+0800 (China Standard Time)

Note that we are working on newer models with the tag "country" (#57).

We are also looking to publish our clean and noisy dataset soon, and we are looking for new data. Is it possible for you to share your data under an MIT license (or something similar)?

Wildan Gunawan · Answer 3 · Sat Mar 13 2021 23:10:17 GMT+0800 (China Standard Time)

e.g: ('south fraser way abbotsford b. c. v2t 1v6', ['StreetName', 'StreetName', 'StreetName', 'Municipality', 'Province', 'Province', 'PostalCode', 'PostalCode'])).

Note that we only support the following list of tags currently:

“StreetNumber”: for the street number

“StreetName”: for the name of the street

“Unit”: for the unit (such as apartment)

“Municipality”: for the municipality

“Province”: for the province or local region

“PostalCode”: for the postal code

“Orientation”: for the street orientation (e.g. west, east)

“GeneralDelivery”: for other delivery information

Do this help clear things up?

Hi, just want to make sure: Is it typos there on the example (3 "StreetName")?

Also, should it be tuple in tuple ((data1), (data2)) or tuple in list [(data1), (data2)]? I tried with list but get KeyError from PyTorch. Should it be tuple in tuple or what?

Marouane Yassine · Answer 4 · Mon Mar 15 2021 00:26:42 GMT+0800 (China Standard Time)

It's not a typo. What we actually do is return a tag for each word and since south fraser way is a street name we assign the StreetName tag to each word and then let you choose the way you wish to process the output.

For your second point, it should be a list of tuples: [(address1, tags1), (address2, tags2), ....]

Can you share the stack trace of your error so I get a better idea of where the problem lies?

Wildan Gunawan · Answer 5 · Mon Mar 15 2021 12:33:04 GMT+0800 (China Standard Time)

OOOHHH. I'm the one that's wrong here, hehe. Okay got it.

What if I have a sentence that don't belong to any tag? For example: We sell apple at Orchid Park Street.

Here: we, sell, apple, and at don't belong to any tag but I want to get the "Orchid Park Street" instead for StreetName tag.

Edit: Gotcha. It should be GeneralDelivery instead. Thanks!

David Beauchemin · Answer 6 · Mon Mar 15 2021 21:47:36 GMT+0800 (China Standard Time)

Yes exactly. But I don't think we have seen addresses like that, if you see a drop in performance, maybe do a little retraining.

If you have numerous addresses like that, we would be interested to have those to create a large database and made them available for the community.

Wildan Gunawan · Answer 7 · Thu Mar 18 2021 11:45:29 GMT+0800 (China Standard Time)

Unfortunately it's not possible for me to share the dataset since it's not owned by me and only for a competition only :)