Proper hyphen linked address components (unit-address) splitting
davebulaval opened this issue · comments
Originally posted by @MasseGuillaume in #136 (comment)
I think the parsing for apartments in Canada can be improved:
If you take a look at:
https://www.canadapost-postescanada.ca/cpc/en/support/kb/sending/general-information/how-to-address-mail-and-parcels
Put a hyphen between the unit/suite/apartment number and the street number. Don’t use the # symbol.
address_parser("1-123 Rue Toto Montreal Canada")
obtained:
FormattedParsedAddress<StreetNumber='1-123', StreetName='rue toto', Municipality='montreal', Province='canada'>
expected:
FormattedParsedAddress<StreetNumber='123', StreetName='rue toto', Unit='1' Municipality='montreal', Province='canada'>
NB. libpostal gives the same incorrect result:
docker run -d -p 8080:8080 clicksend/libpostal-rest
curl -X POST -d '{"query": "1-123 rue toto Montreal Quebec Canada"}' localhost:8080/parser | jq "."
[
{
"label": "house_number",
"value": "1-123"
},
{
"label": "road",
"value": "rue toto"
},
{
"label": "city",
"value": "montreal"
},
{
"label": "state",
"value": "quebec"
},
{
"label": "country",
"value": "canada"
}
]
Out-of-the-box performances evaluated on a new dataset for these cases yields the following performance.
Model Type | Accuracy |
---|---|
FastText | 86,50 |
FaxtTextAtt | 87,72 |
BPEmb | 71,85 |
BPEmbAtt | 87,81 |