[performance] comma
davebulaval opened this issue · comments
David Beauchemin commented
Addresses with comma seem to lower performance.
from deepparse.parser import AddressParser
dp = AddressParser(model="bpemb", device=0)
dp("2020 boul. René-Lévesques, Montréal, QC, Canada", with_prob=True).address_parsed_components
#> [('2020', ('PostalCode', 0.8566)),
#> ('boul.', ('Province', 0.7204)),
#> ('René-Lévesques,', ('StreetName', 0.7636)),
#> ('Montréal,', ('StreetName', 0.9614)),
#> ('QC,', ('StreetName', 0.7382)),
#> ('Canada', ('Province', 0.5126))]
parsed_address = address_parser("2020 boul. René-Lévesques Montréal QC", with_prob=True)
>>> print(parsed_address.address_parsed_components)
[('2020', ('PostalCode', 0.9467)), ('boul.', ('StreetName', 0.9895)), ('René-Lévesques', ('StreetName', 0.9602)), ('Montréal', ('Municipality', 0.9965)), ('QC', ('Province', 0.9999))]
David Beauchemin commented
From our training dataset, less than 0,006
% of our address contains at least a comma.
David Beauchemin commented
Fixed using the removal of ,
. Will improve robustness in further models.