GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Looking for improvement ideas for Deepparse

davebulaval opened this issue · comments

We are looking for ideas to improve Deepparse. So, if you have any, you can provide them by replying to this issue.

Hey @davebulaval - great library!

One thing I can think of is a "searcher". (e.g., imagine I have a long text, and I want to detect all addresses in that text..)

Hi, @duarteocarmo tks!

It is clearly a nice use case that we have already thought about and put effort into (with an internship). The problem is that we cannot get our hands on a nice dataset to train statistical models on. We were able to do some work on English text, but we cannot make a multilingual approach as of today due to the lack of datasets.

Moreover, if you have data in your possession, we could take a look to develop something.

I think any email dataset would do no? 😉

Email datasets are typically not multilingual; thus, it is not a nice dataset to train statistical models on for a multinational case. Since Deepparse is multilingual, it would be incoherent to propose monolingual solutions. Moreover, email datasets are, by design, short text (and are monolingual) that might not generalize well for longer text.

I was thinking about the signatures - but you are right that it might not generalize.

One other idea is to use something like spacy NET to detect locations. If a location is detected, you can start using those spans to build a dataset.

But only a suggestion. Let me know if you need anymore input - happy to help

I would use REGEX instead of statistical models for email signatures since people usually place addresses "below" their names. Thus, it would be easier to extract using REGEX.

We also thought about using SpaCy NER to pre-annotate a dataset, but the problem is still creating a multinational addresses dataset. That is, yes, I can validate addresses written in French and English but not in other languages.

Maybe, we could create a wrapper around SOTA multilingual NER such as BERT-like models and rely on their performance in NLU to address the matters and not train/retrain the model. That could be an 'easy' way to handle it. Moreover, we could also offer a retrain method to fine-tune a pre-trained model on a specific language. That could work.

I will think about that and get back to you. I had to think of the necessary efforts and see if I can get the resources to do it.

I think the parsing for apartments in Canada can be improved:

If you take a look at:
https://www.canadapost-postescanada.ca/cpc/en/support/kb/sending/general-information/how-to-address-mail-and-parcels

Put a hyphen between the unit/suite/apartment number and the street number. Don’t use the # symbol.

address_parser("1-123 Rue Toto Montreal Canada")

obtained:

FormattedParsedAddress<StreetNumber='1-123', StreetName='rue toto', Municipality='montreal', Province='canada'>

expected:

FormattedParsedAddress<StreetNumber='123', StreetName='rue toto', Unit='1' Municipality='montreal', Province='canada'>

NB. libpostal gives the same incorrect result:

docker run -d -p 8080:8080 clicksend/libpostal-rest  
curl -X POST -d '{"query": "1-123 rue toto Montreal Quebec Canada"}' localhost:8080/parser | jq "."
[
  {
    "label": "house_number",
    "value": "1-123"
  },
  {
    "label": "road",
    "value": "rue toto"
  },
  {
    "label": "city",
    "value": "montreal"
  },
  {
    "label": "state",
    "value": "quebec"
  },
  {
    "label": "country",
    "value": "canada"
  }
]

@MasseGuillaume, I've pushed a fix in dev to handle this case. However, since during training, no proper cases were seen, the performance yielded lower results than usual performance (see here). Therefore, I will work on new models, but it will take a few days to create them.

In the meantime, I've created a new small dataset that can be use to fine-tune the models to increase performance for those cases. I recommend adding other "normal" data (not just this case) during the fine-tuning.

V2 (I've changed the approach)

@MasseGuillaume, I've pushed a fix in dev to handle this case during parsing where one can use the flag with_hyphen_split to handle these cases. However, since during training, no proper cases were seen, the performance yielded lower results than usual performance (see here). I've created a new small dataset that can be used to fine-tune the models to increase performance for those cases. I recommend adding other "normal" data (not just this case) during the fine-tuning.

However, since some countries do not use hyphens as so, I will not release new models that explicitly handle these cases since it will lower performance for other countries (this is why I've added a flag instead of automatically handling it for all addresses).

@MasseGuillaume it is released in 0.8.0.

Just spitballing here, but is there any value in making it possible to restrict the scope of the parser? For example, we know all the addresses we get are in the UK, so we're wondering if adding that as a constraint could either improve the accuracy or speed/memory usage.

Just spitballing here, but is there any value in making it possible to restrict the scope of the parser? For example, we know all the addresses we get are in the UK, so we're wondering if adding that as a constraint could either improve the accuracy or speed/memory usage.

Hi @AndrewIngram, It would improve accuracy but not speed or memory usage. However, if we reduce the model scope (i.e. size), we could reduce the model size, thus decreasing memory usage and improving speed since the model will only focus on one country.

Unfortunately, I don't see an easy way to implement this that will not require too much time for me as a single dev.

However, there is an option for you! Our dataset is public, and we have UK addresses. So you could retrain a smaller model only on the UK address, increase performance and speed, and lower memory usage. I will add (probably next week) a complete example in the doc for the UK along with metrics monitoring and all for you to take inspiration on.

Hi @AndrewIngram, I've worked on this matter for an example (i.e. reducing the seq2seq hidden size for both models), and unfortunately, results are not improving for the memory usage or the processing speed for both models.

However, you can improve performance for specific countries using our public dataset. So you could download the dataset, use the gb.p files to retrain our parser using the following procedure and use it for your data. I've pushed a subset of the codebase I've used to experiment on memory size and processing speed here. I will not merge this branch either delete it since I don't find this codebase relevant for documentation. But, for your specific case, you could start from there for the extraction of the UK addresses.

If you need more help, we could set a meeting and discuss other approaches that you could develop outside of Deepparse.

Edit 19/08/2022: added the code example in the doc.

i have a dataset of address can you guide how i can add tags to all address thanks

Hi, @ZubairShahzad,

Since it is not an improvement idea, I will respond to the other issue (#139) you have opened.

Moved the discussion into the Discussions feature from GitHub. See here.