spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unexpected behavior if "no space after period"

alex-breen opened this issue · comments

When I run doc.compute('root') or doc.terms or doc.splitOn(), I've noticed that if two words are separated by a period without a space after the period, the word does not split.

E.g. "Let's go.Then return" results in ["let's, "go.Then", "return"]

Is this deliberate so that URLs (with periods) aren't split up? Or is it a bug?

Possibly related, the splitOn examples from https://observablehq.com/@spencermountain/compromise-split don't return the results I'd expect for comma, period, and space.

Thanks!

@alex-breen - not saying that this isn't a bug / shouldn't be addressed. As far as my knowledge these should be split and removed the period / commas from the words.

But not sure if you should be calling another function to do this. As far as I recall (not on computer at the moment - if you look in the terms object you should be able to see normalize text in that object).

Hey alex - sorry for the delay. Yep - we disambiguate periods for a number of cases and the whitespace, (or eol) is pretty important to the sentence splitter. Compromise assumes all input text is correct and this IMO seems like a typo to correct before analysis.
You can shim the sentence tokenizer with a custom method, or clean it up beforehand.

.split() is a different method - it is used for loops in the api, and wont change the tokenization. sorry for the confusion.
Cheers

Thanks for clarifying! Compromise is awesome.