Looking for improvement ideas for Deepparse

Question

Looking for improvement ideas for Deepparse

davebulaval opened this issue 2 years ago · comments

We are looking for ideas to improve Deepparse. So, if you have any, you can provide them by replying to this issue.

Duarte OC · Answer 1 · Fri Jun 24 2022 17:22:49 GMT+0800 (China Standard Time)

Hey @davebulaval - great library!

One thing I can think of is a "searcher". (e.g., imagine I have a long text, and I want to detect all addresses in that text..)

David Beauchemin · Answer 2 · Fri Jun 24 2022 17:52:37 GMT+0800 (China Standard Time)

Hi, @duarteocarmo tks!

It is clearly a nice use case that we have already thought about and put effort into (with an internship). The problem is that we cannot get our hands on a nice dataset to train statistical models on. We were able to do some work on English text, but we cannot make a multilingual approach as of today due to the lack of datasets.

Moreover, if you have data in your possession, we could take a look to develop something.

Duarte OC · Answer 3 · Fri Jun 24 2022 17:53:55 GMT+0800 (China Standard Time)

I think any email dataset would do no? 😉

David Beauchemin · Answer 4 · Fri Jun 24 2022 18:20:45 GMT+0800 (China Standard Time)

Email datasets are typically not multilingual; thus, it is not a nice dataset to train statistical models on for a multinational case. Since Deepparse is multilingual, it would be incoherent to propose monolingual solutions. Moreover, email datasets are, by design, short text (and are monolingual) that might not generalize well for longer text.

Duarte OC · Answer 5 · Fri Jun 24 2022 18:22:41 GMT+0800 (China Standard Time)

I was thinking about the signatures - but you are right that it might not generalize.

One other idea is to use something like spacy NET to detect locations. If a location is detected, you can start using those spans to build a dataset.

But only a suggestion. Let me know if you need anymore input - happy to help

David Beauchemin · Answer 6 · Fri Jun 24 2022 18:58:39 GMT+0800 (China Standard Time)

I would use REGEX instead of statistical models for email signatures since people usually place addresses "below" their names. Thus, it would be easier to extract using REGEX.

We also thought about using SpaCy NER to pre-annotate a dataset, but the problem is still creating a multinational addresses dataset. That is, yes, I can validate addresses written in French and English but not in other languages.

Maybe, we could create a wrapper around SOTA multilingual NER such as BERT-like models and rely on their performance in NLU to address the matters and not train/retrain the model. That could be an 'easy' way to handle it. Moreover, we could also offer a retrain method to fine-tune a pre-trained model on a specific language. That could work.

I will think about that and get back to you. I had to think of the necessary efforts and see if I can get the resources to do it.

Guillaume Massé (马赛卫) · Answer 7 · Mon Jul 04 2022 10:18:18 GMT+0800 (China Standard Time)

I think the parsing for apartments in Canada can be improved:

If you take a look at:
https://www.canadapost-postescanada.ca/cpc/en/support/kb/sending/general-information/how-to-address-mail-and-parcels

Put a hyphen between the unit/suite/apartment number and the street number. Don’t use the # symbol.

address_parser("1-123 Rue Toto Montreal Canada")

obtained:

FormattedParsedAddress<StreetNumber='1-123', StreetName='rue toto', Municipality='montreal', Province='canada'>

expected:

FormattedParsedAddress<StreetNumber='123', StreetName='rue toto', Unit='1' Municipality='montreal', Province='canada'>

NB. libpostal gives the same incorrect result:

docker run -d -p 8080:8080 clicksend/libpostal-rest  
curl -X POST -d '{"query": "1-123 rue toto Montreal Quebec Canada"}' localhost:8080/parser | jq "."

[
  {
    "label": "house_number",
    "value": "1-123"
  },
  {
    "label": "road",
    "value": "rue toto"
  },
  {
    "label": "city",
    "value": "montreal"
  },
  {
    "label": "state",
    "value": "quebec"
  },
  {
    "label": "country",
    "value": "canada"
  }
]

David Beauchemin · Answer 8 · Mon Jul 04 2022 23:04:55 GMT+0800 (China Standard Time)

@MasseGuillaume, I've pushed a fix in dev to handle this case. However, since during training, no proper cases were seen, the performance yielded lower results than usual performance (see here). Therefore, I will work on new models, but it will take a few days to create them.

In the meantime, I've created a new small dataset that can be use to fine-tune the models to increase performance for those cases. I recommend adding other "normal" data (not just this case) during the fine-tuning.

David Beauchemin · Answer 9 · Mon Jul 04 2022 23:47:04 GMT+0800 (China Standard Time)

V2 (I've changed the approach)

@MasseGuillaume, I've pushed a fix in dev to handle this case during parsing where one can use the flag with_hyphen_split to handle these cases. However, since during training, no proper cases were seen, the performance yielded lower results than usual performance (see here). I've created a new small dataset that can be used to fine-tune the models to increase performance for those cases. I recommend adding other "normal" data (not just this case) during the fine-tuning.

However, since some countries do not use hyphens as so, I will not release new models that explicitly handle these cases since it will lower performance for other countries (this is why I've added a flag instead of automatically handling it for all addresses).

David Beauchemin · Answer 10 · Thu Jul 07 2022 23:14:17 GMT+0800 (China Standard Time)

@MasseGuillaume it is released in 0.8.0.

Andy Ingram · Answer 11 · Sat Jul 09 2022 00:27:12 GMT+0800 (China Standard Time)

Just spitballing here, but is there any value in making it possible to restrict the scope of the parser? For example, we know all the addresses we get are in the UK, so we're wondering if adding that as a constraint could either improve the accuracy or speed/memory usage.

David Beauchemin · Answer 12 · Sat Jul 09 2022 01:25:08 GMT+0800 (China Standard Time)

Just spitballing here, but is there any value in making it possible to restrict the scope of the parser? For example, we know all the addresses we get are in the UK, so we're wondering if adding that as a constraint could either improve the accuracy or speed/memory usage.

Hi @AndrewIngram, It would improve accuracy but not speed or memory usage. However, if we reduce the model scope (i.e. size), we could reduce the model size, thus decreasing memory usage and improving speed since the model will only focus on one country.

Unfortunately, I don't see an easy way to implement this that will not require too much time for me as a single dev.

However, there is an option for you! Our dataset is public, and we have UK addresses. So you could retrain a smaller model only on the UK address, increase performance and speed, and lower memory usage. I will add (probably next week) a complete example in the doc for the UK along with metrics monitoring and all for you to take inspiration on.

David Beauchemin · Answer 13 · Sat Jul 09 2022 06:08:08 GMT+0800 (China Standard Time)

Hi @AndrewIngram, I've worked on this matter for an example (i.e. reducing the seq2seq hidden size for both models), and unfortunately, results are not improving for the memory usage or the processing speed for both models.

However, you can improve performance for specific countries using our public dataset. So you could download the dataset, use the gb.p files to retrain our parser using the following procedure and use it for your data. I've pushed a subset of the codebase I've used to experiment on memory size and processing speed here. I will not merge this branch either delete it since I don't find this codebase relevant for documentation. But, for your specific case, you could start from there for the extraction of the UK addresses.

If you need more help, we could set a meeting and discuss other approaches that you could develop outside of Deepparse.

Edit 19/08/2022: added the code example in the doc.

Zubair Shahzad Arain · Answer 14 · Fri Jul 15 2022 16:28:07 GMT+0800 (China Standard Time)

i have a dataset of address can you guide how i can add tags to all address thanks

David Beauchemin · Answer 15 · Fri Jul 15 2022 22:20:29 GMT+0800 (China Standard Time)

Hi, @ZubairShahzad,

Since it is not an improvement idea, I will respond to the other issue (#139) you have opened.

David Beauchemin · Answer 16 · Mon Aug 01 2022 00:44:24 GMT+0800 (China Standard Time)

Moved the discussion into the Discussions feature from GitHub. See here.