spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Punctuation following abbreviations causes sentences to merge

Fdawgs opened this issue · comments

Node version: 18.18.2
Compromise version: 14.10

As title states, full stops and other punctuation types that denote an end of a sentence (?! etc.) that occur after an abbreviation causes the trailing sentence to be treated as part of the original sentence.

Reproduction:

const nlp = require('compromise');

const text = "Dr. Hibbert has advised starting Homer on morphine 400 mg. I have copied this letter to his general practitioner.";
const sentences = nlp(text).sentences().out('array');
console.log(sentences);
/**
 * outputs: 
 * [
 *    'Dr. Hibbert has advised starting Homer on morphine 400 mg. I have copied this letter to his general practitioner.'
 * ]
 */

Comparison without using an abbreviation:

const nlp = require('compromise');

const text = "Dr. Hibbert has advised starting Homer on morphine 400 milligrams. I have copied this letter to his general practitioner.";
const sentences = nlp(text).sentences().out('array');
console.log(sentences);
/**
 * outputs: 
 * [
 *    'Dr. Hibbert has advised starting Homer on morphine 400 milligrams.',
 *    'I have copied this letter to his general practitioner.'
 * ]
 */

hey Frazer, with periods, this is the expected behaviour for abbreviations, like 400 mg. of THC, and a sr. in high-school.
but yeah '12 mg!' and 12 mg? should truncate the sentence.
will add this one to the list. Good catch
cheers

fixed in 14.10.1, thanks for the help

commented

@spencermountain this is half-fixed. I think the problem is when an abbreviation is used in text and then followed by a genuine new sentence.

I prescribed him 400mg. He went to the pharmacy.

As I think about this, I guess there's no easy fix for it. We could detect an uppercase next work but I imagine that will have a lot of false-positives.

FWIW we have a large body of clinical dialogue in text and we rarely would see the . after a unit abbreviation. It's not common at all. Usually it's presented without the . i.e. He was injected with 400mg of morphine.