GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[performance] comma

davebulaval opened this issue · comments

Addresses with comma seem to lower performance.

from deepparse.parser import AddressParser

dp = AddressParser(model="bpemb", device=0)

dp("2020 boul. René-Lévesques, Montréal, QC, Canada", with_prob=True).address_parsed_components
#> [('2020', ('PostalCode', 0.8566)),
#>  ('boul.', ('Province', 0.7204)),
#>  ('René-Lévesques,', ('StreetName', 0.7636)),
#>  ('Montréal,', ('StreetName', 0.9614)),
#>  ('QC,', ('StreetName', 0.7382)),
#>  ('Canada', ('Province', 0.5126))]

parsed_address = address_parser("2020 boul. René-Lévesques Montréal QC", with_prob=True)

>>> print(parsed_address.address_parsed_components)

[('2020', ('PostalCode', 0.9467)), ('boul.', ('StreetName', 0.9895)), ('René-Lévesques', ('StreetName', 0.9602)), ('Montréal', ('Municipality', 0.9965)), ('QC', ('Province', 0.9999))]

From our training dataset, less than 0,006% of our address contains at least a comma.

Fixed using the removal of ,. Will improve robustness in further models.