PlanTL-GOB-ES / SPACCC_Sentence-Splitter

[PlanTL/medicine/document annotation/NLP preprocessing/sentence splitter] Sentence splitting model created using the Apache OpenNLP machine learning toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Sentence Splitter (SS) for Clinical Cases Written in Spanish

Digital Object Identifier (DOI) and access to dataset files

https://doi.org/10.5281/zenodo.2586995

Introduction

This repository contains the sentence splitting model trained using the SPACCC_SPLIT corpus (https://github.com/PlanTL-SANIDAD/SPACCC_SPLIT). The model was trained using the 90% of the corpus (900 clinical cases) and tested against the 10% (100 clinical cases). This model is a great resource to split sentences in biomedical documents, specially clinical cases written in Spanish. This model obtains a F-Measure of 98.75%.

This model was created using the Apache OpenNLP machine learning toolkit (https://opennlp.apache.org/), with the release number 1.8.4, released in December 2017.

This repository contains the model, training set, testing set, Gold Standard, executable file, and the source code.

Prerequisites

This software has been compiled with Java SE 1.8 and it should work with recent versions. You can download Java from the following website: https://www.java.com/en/download

The executable file already includes the Apache OpenNLP dependencies inside, so the download of this toolkit is not necessary. However, you may download the latest version from this website: https://opennlp.apache.org/download.html

The library file we have used to compile is "opennlp-tools-1.8.4.jar". The source code should be able to compile with the latest version of OpenNLP, "opennlp-tools-RELEASE_NUMBER.jar". In case there are compilation or execution errors, please let us know and we will make all the necessary updates.

Directory structure

exec/
  An executable file that can be used to apply the sentence splitter to your documents. 
  You can find the notes about its execution below in section "Usage".

gold_standard/
  The clinical cases used as gold standard to evaluate the model's performance.
  
model/
  The sentence splitting model, "es-sentence-splitter-model-spaccc.bin", a binary file.
  
src/
  The source code to create the model (CreateModelSS.java) and evaluate it (EvaluateModelSS.java). 
  The directory includes an example about how to use the model inside your code (SentenceSplitter.java).
  File "abbreviations.dat" contains a list of abbreviations, essential to build the model.

test_set/
  The clinical cases used as test set to evaluate the model's performance.

train_set/
  The clinical cases used to build the model. We use a single file with all documents present in 
  directory "train_set_docs" concatented.

train_set_docs/
  The clinical cases used to build the model. For each record the sentences are already splitted.

Usage

The executable file SentenceSplitter.jar is the program you need to split the sentences of the document. For this program, two arguments are needed: (1) the text file to split the sentences, and (2) the model file (es-sentence-splitter-model-spaccc.bin). The program will display all sentences splitted in the terminal, with one sentence per line.

From the exec folder, type the following command in your terminal:

$ java -jar SentenceSplitter.jar INPUT_FILE MODEL_FILE

Examples

Assuming you have the executable file, the input file and the model file in the same directory:

$ java -jar SentenceSplitter.jar file_with_sentences_not_splitted.txt es-sentence-splitter-model-spaccc.bin

Model creation

To create this sentence splitting model, we used the following training parameters (class TrainingParameters in OpenNLP) to get the best performance:

  • Number of iterations: 4000.
  • Cutoff parameter: 3.
  • Trainer type parameter: EventTrainer.EVENT_VALUE.
  • Algorithm: Maximum Entropy (ModelType.MAXENT.name()).

Meanwhile, we used the following parameters for the sentence split builder (class SentenceDetectorFactory in OpenNLP) to get the best performance:

  • Subclass name: null value.
  • Language code: es (for Spanish).
  • Use token end: true.
  • Abbreviation dictionary: file "abbreviations.dat" (included in the src/ directory).
  • End of file characters: ".", "?" and "!".

Model evaluation

After tuning the model using different values for each parameter mentioned above, we got the best performance with the values mentioned above.

Value
Number of sentences in the gold standard 1445
Number of sentences generated 1447
Number of sentences correctly splitted 1428
Number of sentences wrongly splitted 12
Number of sentences missed 5
Precision 98.69%
Recall 98.82%
F-Measure 98.75%

Table 1: Evaluation statistics for the sentence splitting model.

Contact

Ander Intxaurrondo (ander.intxaurrondo@bsc.es)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2018 SecretarĂ­a de Estado para el Avance Digital (SEAD)

About

[PlanTL/medicine/document annotation/NLP preprocessing/sentence splitter] Sentence splitting model created using the Apache OpenNLP machine learning toolkit

License:Other


Languages

Language:Java 100.0%