spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`.not()` is destructive to punctuation

Fdawgs opened this issue · comments

Node version: 20.10.0
Compromise version: 14.10.1

As title states, punctuation is removed if it is next to what is being removed by .not().
In the examples below, you can see the brackets have been removed in the first two, and the exclamation mark with the last.

Reproduction:

const nlp = require('compromise')

const text = 'The leftorium sells left-handed products (Ned Flanders is the owner)'
const result1 = nlp(text).not('#Person').text()
console.log(result1)
/**
 * outputs:
 * The leftorium sells left-handed products is the owner)
 */

const text2 = 'The leftorium sells left-handed products (the owner is Ned Flanders)'
const result2 = nlp(text2).not('Ned Flanders').text()
console.log(result2)
/**
 * outputs:
 * The leftorium sells left-handed products (the owner is
 */

const text3 = 'The leftorium sells left-handed products, the owner is Ned Flanders!'
const result3 = nlp(text3).not('Ned Flanders').text()
console.log(result3)
/**
 * outputs:
 * The leftorium sells left-handed products, the owner is
 */

Potentially related to #1022.

I noticed this happens too.
Seems to be when not() is omitting text that occurs before punctuation. For me this has manifested when I pair not() with parentheses().

// Works as expected
> nlp('this is (kinda) messy').not('messy').parentheses().out('array')
[ '(kinda)' ]

// No results for parentheses()
> nlp('this is (kinda) messy').not('this').parentheses().out('array')
[]

// Multiple terms in parentheses are lost
> nlp('this is (kinda really) messy').not('this').parentheses().out('array')
[ '(kinda' ]

hey, yep that's right - compromise is tokenizing punctuation into pre-text, and post-text, and has some opinions on what term a punctuation should be on, or if it should hang on the left or the right of a term.

There's also the guesswork in .text() if it should print leading or trailing punctuation - sometimes it should and sometimes it shouldn't, and it decides based on how chopped-up the match is. I think that's what's happening in Ned Flanders examples.

That being said, this does appear to be a bug:

nlp('this is (kinda) messy').not('this').parentheses()

I'll take a look at all of these, if I can, today.
cheers

hey @Fdawgs - you may want to try .remove() which mutates the document, instead of .not(), which just changes the current match. It's a subtle difference, but .remove() will do some of the things you seek, regarding repairing sentence-punctuation, and things:

const text3 = 'The leftorium sells left-handed products, the owner is Ned Flanders!'
const result3 = nlp(text3).remove('Ned Flanders').text()
console.log(result3)
//The leftorium sells left-handed products, the owner is!

Please let me know if you spot .remove() mangling punctuation in unexpected ways. I don't think it's been tested very well. It would be fun to improve.

@track0x1 fix is on dev, will be part of next release, likely this week.
cheers

fixed in 14.12.0

nlp('this is (kinda) messy').not('this').parentheses() //'(kinda)'

cheers