Sentencizer cut codes in different sentences while it's the same token

Question

Sentencizer cut codes in different sentences while it's the same token

etienneguevel opened this issue 2 years ago · comments

Description

For the moment the sentencizer makes a new sentence when there is a "." character followed by a capitalized letter.
This can be problematic for some codes or accronyms, as they can be constructed with those patterns (example : "V.I.H",), and will be divided in different sentences.

The ADICAP codes analysed by the eds.adicap pipeline can be found in text in the form : "code ADICAP : B.H.HP.A7A0", and the eds.contextual-matcher used behind will not capture the code.

A solution would be to create a new sentence if there is a . followed by a space/new line/other separation and a capitalized letter.

How to reproduce the bug

import spacy

nlp = spacy.blank("eds")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")

code = "B.H.HP.A7A0"

for sent in nlp(code).sents:
    print(sent.text)

B.
H.
HP.
A7A0

Your Environment

Operating System: Ubuntu 22.04.1 LTS
Python Version Used: 3.10.6
spaCy Version Used: 3.4.1
EDS-NLP Version Used: 0.7.4
Environment Information:

Perceval Wajsburt · Answer 1 · Wed Mar 08 2023 01:01:47 GMT+0800 (China Standard Time)

Thanks for this issue ! this has been solved by changing the tokenization rules to distinguish "real" end-of-sentence periods from abbreviation periods in #192