`.not()` is destructive to punctuation
Fdawgs opened this issue · comments
Node version: 20.10.0
Compromise version: 14.10.1
As title states, punctuation is removed if it is next to what is being removed by .not()
.
In the examples below, you can see the brackets have been removed in the first two, and the exclamation mark with the last.
Reproduction:
const nlp = require('compromise')
const text = 'The leftorium sells left-handed products (Ned Flanders is the owner)'
const result1 = nlp(text).not('#Person').text()
console.log(result1)
/**
* outputs:
* The leftorium sells left-handed products is the owner)
*/
const text2 = 'The leftorium sells left-handed products (the owner is Ned Flanders)'
const result2 = nlp(text2).not('Ned Flanders').text()
console.log(result2)
/**
* outputs:
* The leftorium sells left-handed products (the owner is
*/
const text3 = 'The leftorium sells left-handed products, the owner is Ned Flanders!'
const result3 = nlp(text3).not('Ned Flanders').text()
console.log(result3)
/**
* outputs:
* The leftorium sells left-handed products, the owner is
*/
Potentially related to #1022.
I noticed this happens too.
Seems to be when not()
is omitting text that occurs before punctuation. For me this has manifested when I pair not()
with parentheses()
.
// Works as expected
> nlp('this is (kinda) messy').not('messy').parentheses().out('array')
[ '(kinda)' ]
// No results for parentheses()
> nlp('this is (kinda) messy').not('this').parentheses().out('array')
[]
// Multiple terms in parentheses are lost
> nlp('this is (kinda really) messy').not('this').parentheses().out('array')
[ '(kinda' ]
hey, yep that's right - compromise is tokenizing punctuation into pre-text
, and post-text
, and has some opinions on what term a punctuation should be on, or if it should hang on the left or the right of a term.
There's also the guesswork in .text()
if it should print leading or trailing punctuation - sometimes it should and sometimes it shouldn't, and it decides based on how chopped-up the match is. I think that's what's happening in Ned Flanders examples.
That being said, this does appear to be a bug:
nlp('this is (kinda) messy').not('this').parentheses()
I'll take a look at all of these, if I can, today.
cheers
hey @Fdawgs - you may want to try .remove()
which mutates the document, instead of .not()
, which just changes the current match. It's a subtle difference, but .remove()
will do some of the things you seek, regarding repairing sentence-punctuation, and things:
const text3 = 'The leftorium sells left-handed products, the owner is Ned Flanders!'
const result3 = nlp(text3).remove('Ned Flanders').text()
console.log(result3)
//The leftorium sells left-handed products, the owner is!
Please let me know if you spot .remove()
mangling punctuation in unexpected ways. I don't think it's been tested very well. It would be fun to improve.
@track0x1 fix is on dev, will be part of next release, likely this week.
cheers
fixed in 14.12.0
nlp('this is (kinda) messy').not('this').parentheses() //'(kinda)'
cheers