Text containing quotes or parentheses sometimes isn't split into sentences correctly.
julianpeterson1 opened this issue · comments
There are some cases where the sentence parser doesn't parse correctly when using quotations or paratheses:
Example 1:
Descartes famously said, "I think therefore I am." I think Descartes is wrong.
Should return an array of two sentences:
- Descartes famously said, "I think therefore I am."
- I believe Descartes is wrong
Instead, it returns just a single sentence. (this is an issue with either inline quotes or parentheses).
Example 2:
In the case where multiple sentences exist within a set of paratheses or an inline quote, the sentence parser doesn't return the correct result:
Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.) I believe Descartes is wrong.
- Should return an array of two sentences:
- Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.)
- I believe Descartes is wrong.
Instead, it returns the whole text as a single sentence.
Thanks! Awesome library.
Hey Julian - apologies for the delay, I've been off-keyboard for a week or two.
yea - I understand the frustration, I've gone back and forth on this a few times. If you have strong feelings about one style, I could be persuaded.
My concern was things like Descartes famously said "Yo!" and I agree.
- I didn't want to tokenize "descartes famously said" as a full sentence. Maybe there's a good way to classify scare-quotes vs block-quotes - if it has a subj-verb-obj? I dunno.
You can see the current logic here to determine if a sentence is within a quotation - it simply uses a character-count. PR is welcome, if there's a proper definition from oxford or something. Maybe some other tokenizers have clearer opinions.
you're also welcome to swap-out a custom sentence splitter completely - I had to do it for the japanese compromise and can help you if you prefer this.
cheers
Hey Spencer,
I think the rule is that full-stop punctuation at the end of a quotation or a set of parentheses should be considered the end of the whole sentence unless it is followed by coordinating conjunction, such as and, but, or, etc.
For example:
Descartes famously said "Yo!" and I agree. -- One sentence
Descartes famously said "Yo!" but I agree. One sentence.
Descartes famously said "Yo!" I agree. Two sentences.
Without that conjunction, the splitting should consider the full stop punctuation to signify the end of the sentence.
Let me know what you think,
Julian