Detection of successive empty sections
paul-bssr opened this issue · comments
Description
Using the beta pipeline "eds.sections", I encountered a bug in detecting sections preceded by an empty section. Indeed, if a section is preceded by an empty section, it is not detected and its content is labeled as belonging to the first section.
Empty sections are quite common in documents, so this could lead to errors in section labelling of entities.
How to reproduce the bug
For instance, in the following example, sections "Antécédents :" and "Conclusion" are not distinguished. Therefore, all the content of "Conclusion" section is tagged as "Antécédents".
import spacy
nlp = spacy.blank("eds")
nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sections")
# Definition matcher
regex = dict(
# Myolyse
rhabdomyolyse = "rhabdom[yi]ol[yi]se",
myolyse = "m[yi]ol[yi]se"
)
nlp.add_pipe("eds.matcher",
config = dict(
regex=regex,
attr="NORM",
ignore_excluded=True,
),
)
text = """
Antécédents :
Conclusion :
Patient va mieux
Au total:
sortie du patient
"""
doc.spans["sections"]
Your Environment
- Operating System:
- Python Version Used: 3.10.0
- spaCy Version Used: 3.4.1
- EDS-NLP Version Used: 0.6.1
- Environment Information:
Thank you for letting us aware of this ! I'll look into it
PR #114 should solve this, thanks again for the issue !
While waiting for the next release, you can install from the master branch directly with
pip install git+https://github.com/aphp/edsnlp@master
Thanks a lot for the super quick solve !