Speculation detection of concepts in Dutch clinical text
This repository contains the source code for a Dutch speculation detector for clinical text developed in the scope of the
ACCUMULATE project. The speculation detection is performed specifically for detected clinical concepts within a sentence, rather than on the token level.
Requirements
- Python 3
- Frog
Usage
Data preprocessing
This module processes raw clinical text using Frog and integrates the preprocessed output with user-provided concept annotations on the raw text. Gold standard negation annotations can be included for later evaluation.
from preprocessing import PreprocessCorpus
preprocessor = PreprocessCorpus()
preprocessed_instances = preprocessor(file_ids)
# file_ids = list of paths to .json files containing one dictionary each with the relevant input data
# example input dictionary:
# input_dictionary['text'] = raw clinical text to be processed by Frog
# input_dictionary['concept_spans'] = [{'begin': start_idx, 'end': end_index},
{'begin': start_idx, 'end': end_index}]
# if gold standard annotations are present for negation:
# input_dictionary['speculation_status'] = [True, False]
Tagging of speculation cues
from speculation_tagger import SpeculationTagger
# if gold standard is included, gold_included should be True, else False
tagger = SpeculationTagger(gold_included)
tagged_sentences = tagger(preprocessed_instances)
Speculation detection of clinical concepts
from speculation_detector import SpeculationDetector, SpeculationDetectorEvaluation
# choose model from ['forward', 'backward', 'forward_punct', 'backward_punct', 'finetuned_baseline', 'finetuned_hybrid']
sentence_instances = tagged_sentences['sentence_instances']
# usage for data WITHOUT gold standard speculation annotations
detector = SpeculationDetector()
instances_detection_data = detector.detect(sentence_instances, model)
# usage for data WITH gold standard speculation annotations
detector = SpeculationDetectorEvaluation()
results = detector(sentence_instances, model)
Forward model
Matches the first following concept after a detected speculation cue.
Backward model
Matches the first preceding concept before a detected speculation cue.
Forward punctuation model
Matches all following concepts before the first following punctuation.
Backward punctuation model
Matches all preceding concepts after the first preceding punctuation.
Fine-tuned baseline model
Applies for each cue separately the most effective of the four baseline models.
Fine-tuned hybrid model
Replaces the fine-tuned baseline model for every cue it can outperform with a rule selected from simple rules on the Frog dependency parse.