Hyphenation minimal word length and casing are not UTF8-compliant
Omikhleia opened this issue · comments
Omikhleia commented
Issue
Relates to #2017 regarding the hard-coded minWord = 5
value, but it's however a different type of issue here:
The logic is not UTF-8 compliant:
sile/core/hyphenator-liang.lua
Lines 58 to 63 in b2cc084
- Use of
string.len
is not UTF8-safe, so the minWord value is likely not honored as it ought- It seems to correspond to LuaTeX's
hyphenationmin
- It could be argued that it's in bytes here, but then it's not in line then with leftmin/rightmin (which are counted with respect to characters)...
- It seems to correspond to LuaTeX's
- Use of
text:lower()
is not UTF8-safe - Likewise regarding the
string.lower
call a bit later in aSU.map
(... and aren't we performing the lowercase operation again?somehow acceptable)
Proofs / Minimal examples
The second case here, with minWord at 6, would be expected not to hyphenate "léris":
> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> SILE._hyphenators["fr"].minWord
5
> SILE._hyphenators["fr"].minWord = 6
> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> -- OOPS. "léris" is 5-character long (but 6-byte long)
> SILE._hyphenators["fr"].minWord = 7
> SILE.showHyphenationPoints("léris", "fr")
léris
We override a pattern below, but it doesn't work with an uppercase input (bypassing the exception).
> SILE.call("hyphenator:add-exceptions", { lang="fr" }, { "légè-rement" })% Override as exception
> SILE.showHyphenationPoints("légèrement", "fr")
légè-rement
> SILE.showHyphenationPoints("LÉGÈREMENT", "fr")
LÉGÈ-RE-MENT
> -- OOPS, expected "LÉGÈ-REMENT"