wikimedia / sentencex

A sentence segmentation library with wide language support optimized for speed and utility.

Home Page:https://wikimedia.github.io/sentencex/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

terminators.py contains codepoints which are not sentence terminators

divec opened this issue · comments

commented

One of these is (U+123B Ethiopic Syllable Shaa) which appears in test_am.py:

"ቴዎድሮስ ጥር ፮ ቀን ፲፰፻፲፩ ዓ.ም. ሻርጌ በተባለ ቦታ ቋራ ውስጥ፣ ከጎንደር ከተማ በስተ ምዕራብ ተወለዱ።",
["ቴዎድሮስ ጥር ፮ ቀን ፲፰፻፲፩ ዓ.ም. ሻ", "ርጌ በተባለ ቦታ ቋራ ውስጥ፣ ከጎንደር ከተማ በስተ ምዕራብ ተወለዱ።"],

(The expected value should be one sentence, which would make the test currently fail). It looks like there are several other letter codepoints in the list too.

Also some of the codepoints have incorrect comments. E.g. the last item is U+1DA88 (Signwriting Full Stop) which is correctly in the list, but the comment inaccurately describes it as "Mathematical Bold Capital U".