terminators.py contains codepoints which are not sentence terminators
divec opened this issue · comments
One of these is ሻ (U+123B Ethiopic Syllable Shaa) which appears in test_am.py
:
"ቴዎድሮስ ጥር ፮ ቀን ፲፰፻፲፩ ዓ.ም. ሻርጌ በተባለ ቦታ ቋራ ውስጥ፣ ከጎንደር ከተማ በስተ ምዕራብ ተወለዱ።",
["ቴዎድሮስ ጥር ፮ ቀን ፲፰፻፲፩ ዓ.ም. ሻ", "ርጌ በተባለ ቦታ ቋራ ውስጥ፣ ከጎንደር ከተማ በስተ ምዕራብ ተወለዱ።"],
(The expected value should be one sentence, which would make the test currently fail). It looks like there are several other letter codepoints in the list too.
Also some of the codepoints have incorrect comments. E.g. the last item is U+1DA88 (Signwriting Full Stop) which is correctly in the list, but the comment inaccurately describes it as "Mathematical Bold Capital U".