Unexpected behavior if "no space after period"
alex-breen opened this issue · comments
When I run doc.compute('root')
or doc.terms
or doc.splitOn()
, I've noticed that if two words are separated by a period without a space after the period, the word does not split.
E.g. "Let's go.Then return" results in ["let's, "go.Then", "return"]
Is this deliberate so that URLs (with periods) aren't split up? Or is it a bug?
Possibly related, the splitOn
examples from https://observablehq.com/@spencermountain/compromise-split don't return the results I'd expect for comma, period, and space.
Thanks!
@alex-breen - not saying that this isn't a bug / shouldn't be addressed. As far as my knowledge these should be split and removed the period / commas from the words.
But not sure if you should be calling another function to do this. As far as I recall (not on computer at the moment - if you look in the terms object you should be able to see normalize text in that object).
Hey alex - sorry for the delay. Yep - we disambiguate periods for a number of cases and the whitespace, (or eol) is pretty important to the sentence splitter. Compromise assumes all input text is correct and this IMO seems like a typo to correct before analysis.
You can shim the sentence tokenizer with a custom method, or clean it up beforehand.
.split() is a different method - it is used for loops in the api, and wont change the tokenization. sorry for the confusion.
Cheers
Thanks for clarifying! Compromise is awesome.