Unexpected behavior if "no space after period"

Question

Unexpected behavior if "no space after period"

alex-breen opened this issue 8 months ago · comments

When I run doc.compute('root') or doc.terms or doc.splitOn(), I've noticed that if two words are separated by a period without a space after the period, the word does not split.

E.g. "Let's go.Then return" results in ["let's, "go.Then", "return"]

Is this deliberate so that URLs (with periods) aren't split up? Or is it a bug?

Possibly related, the splitOn examples from https://observablehq.com/@spencermountain/compromise-split don't return the results I'd expect for comma, period, and space.

Thanks!

Jared Van Valkengoed · Answer 1 · Fri Sep 22 2023 02:14:35 GMT+0800 (China Standard Time)

@alex-breen - not saying that this isn't a bug / shouldn't be addressed. As far as my knowledge these should be split and removed the period / commas from the words.

But not sure if you should be calling another function to do this. As far as I recall (not on computer at the moment - if you look in the terms object you should be able to see normalize text in that object).

spencer kelly · Answer 2 · Fri Sep 22 2023 06:50:53 GMT+0800 (China Standard Time)

Hey alex - sorry for the delay. Yep - we disambiguate periods for a number of cases and the whitespace, (or eol) is pretty important to the sentence splitter. Compromise assumes all input text is correct and this IMO seems like a typo to correct before analysis.
You can shim the sentence tokenizer with a custom method, or clean it up beforehand.

.split() is a different method - it is used for loops in the api, and wont change the tokenization. sorry for the confusion.
Cheers

alex-breen · Answer 3 · Fri Sep 22 2023 07:47:24 GMT+0800 (China Standard Time)

Thanks for clarifying! Compromise is awesome.