spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text containing quotes or parentheses sometimes isn't split into sentences correctly.

julianpeterson1 opened this issue · comments

There are some cases where the sentence parser doesn't parse correctly when using quotations or paratheses:

Example 1:

Descartes famously said, "I think therefore I am." I think Descartes is wrong.

Should return an array of two sentences:

  1. Descartes famously said, "I think therefore I am."
  2. I believe Descartes is wrong

Instead, it returns just a single sentence. (this is an issue with either inline quotes or parentheses).

Example 2:

In the case where multiple sentences exist within a set of paratheses or an inline quote, the sentence parser doesn't return the correct result:

Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.) I believe Descartes is wrong.

  • Should return an array of two sentences:
  1. Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.)
  2. I believe Descartes is wrong.

Instead, it returns the whole text as a single sentence.

Thanks! Awesome library.

Hey Julian - apologies for the delay, I've been off-keyboard for a week or two.

yea - I understand the frustration, I've gone back and forth on this a few times. If you have strong feelings about one style, I could be persuaded.

My concern was things like Descartes famously said "Yo!" and I agree. - I didn't want to tokenize "descartes famously said" as a full sentence. Maybe there's a good way to classify scare-quotes vs block-quotes - if it has a subj-verb-obj? I dunno.

You can see the current logic here to determine if a sentence is within a quotation - it simply uses a character-count. PR is welcome, if there's a proper definition from oxford or something. Maybe some other tokenizers have clearer opinions.

you're also welcome to swap-out a custom sentence splitter completely - I had to do it for the japanese compromise and can help you if you prefer this.
cheers

Hey Spencer,

I think the rule is that full-stop punctuation at the end of a quotation or a set of parentheses should be considered the end of the whole sentence unless it is followed by coordinating conjunction, such as and, but, or, etc.

For example:

Descartes famously said "Yo!" and I agree. -- One sentence
Descartes famously said "Yo!" but I agree. One sentence.
Descartes famously said "Yo!" I agree. Two sentences.

Without that conjunction, the splitting should consider the full stop punctuation to signify the end of the sentence.

Let me know what you think,

Julian