Abbreviation detection not working where short form contains a space followed by digits
ICLRandD opened this issue · comments
The current implementation of the AbbreviationDetector()
does not handle abbreviations that contain a short form followed by a space followed by a number
For example, in this scenario:
The Proceeds of Crime Act 2002 ("PoCA 2000")
The abbreviation is not matched.
The original implementation in scispaCy does not appear to have been built to handle instances in which the short form is bounded by quote marks).
You might be interested in an alternative Python implementation of Schwartz-Hearst which handles this scenario.
https://github.com/philgooch/abbreviation-extraction
E.g.
pip install abbreviations
In [1]: from abbreviations import schwartz_hearst
In [2]: schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='The Proceeds of Crime Act 2002 ("PoCA 2002")')
Out[2]: {'PoCA 2002': 'Proceeds of Crime Act 2002'}