ICLRandD / Blackstone

:black_circle: A spaCy pipeline and model for NLP on unstructured legal text.

Home Page:https://research.iclr.co.uk

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Abbreviation detection not working where short form contains a space followed by digits

ICLRandD opened this issue · comments

The current implementation of the AbbreviationDetector() does not handle abbreviations that contain a short form followed by a space followed by a number

For example, in this scenario:

The Proceeds of Crime Act 2002 ("PoCA 2000")

The abbreviation is not matched.

The original implementation in scispaCy does not appear to have been built to handle instances in which the short form is bounded by quote marks).

You might be interested in an alternative Python implementation of Schwartz-Hearst which handles this scenario.

https://github.com/philgooch/abbreviation-extraction

E.g.

pip install abbreviations
In [1]: from abbreviations import schwartz_hearst                                                                                                  

In [2]: schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='The Proceeds of Crime Act 2002 ("PoCA 2002")')                             
Out[2]: {'PoCA 2002': 'Proceeds of Crime Act 2002'}