aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.

Home Page:https://aphp.github.io/edsnlp/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Detection of successive empty sections

paul-bssr opened this issue · comments

Description

Using the beta pipeline "eds.sections", I encountered a bug in detecting sections preceded by an empty section. Indeed, if a section is preceded by an empty section, it is not detected and its content is labeled as belonging to the first section.

Empty sections are quite common in documents, so this could lead to errors in section labelling of entities.

How to reproduce the bug

For instance, in the following example, sections "Antécédents :" and "Conclusion" are not distinguished. Therefore, all the content of "Conclusion" section is tagged as "Antécédents".

import spacy

nlp = spacy.blank("eds")

nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sections")

# Definition matcher
regex = dict(
    # Myolyse
    rhabdomyolyse = "rhabdom[yi]ol[yi]se",
    myolyse = "m[yi]ol[yi]se"
)

nlp.add_pipe("eds.matcher", 
             config = dict(
                 regex=regex, 
                 attr="NORM",
                 ignore_excluded=True,
             ),
            )

text = """
Antécédents : 
Conclusion : 
Patient va mieux

Au total:
sortie du patient
"""

doc.spans["sections"]

Your Environment

  • Operating System:
  • Python Version Used: 3.10.0
  • spaCy Version Used: 3.4.1
  • EDS-NLP Version Used: 0.6.1
  • Environment Information:

Thank you for letting us aware of this ! I'll look into it

PR #114 should solve this, thanks again for the issue !

While waiting for the next release, you can install from the master branch directly with

pip install git+https://github.com/aphp/edsnlp@master

Thanks a lot for the super quick solve !