sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine

Home Page:https://sile-typesetter.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hyphenation minimal word length and casing are not UTF8-compliant

Omikhleia opened this issue · comments

Issue

Relates to #2017 regarding the hard-coded minWord = 5 value, but it's however a different type of issue here:

The logic is not UTF-8 compliant:

if string.len(text) < self.minWord then return { text } end
local points = self.exceptions[text:lower()]
local word = SU.splitUtf8(text)
if not points then
points = SU.map(function ()return 0 end, word)
local work = SU.map(string.lower, word)

  • Use of string.len is not UTF8-safe, so the minWord value is likely not honored as it ought
    • It seems to correspond to LuaTeX's hyphenationmin
    • It could be argued that it's in bytes here, but then it's not in line then with leftmin/rightmin (which are counted with respect to characters)...
  • Use of text:lower() is not UTF8-safe
  • Likewise regarding the string.lower call a bit later in a SU.map (... and aren't we performing the lowercase operation again? somehow acceptable)

Proofs / Minimal examples

The second case here, with minWord at 6, would be expected not to hyphenate "léris":

> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> SILE._hyphenators["fr"].minWord
5
> SILE._hyphenators["fr"].minWord = 6
> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> -- OOPS. "léris" is 5-character long (but 6-byte long)
> SILE._hyphenators["fr"].minWord = 7
> SILE.showHyphenationPoints("léris", "fr")
léris

We override a pattern below, but it doesn't work with an uppercase input (bypassing the exception).

> SILE.call("hyphenator:add-exceptions", { lang="fr" }, { "légè-rement" })% Override as exception
> SILE.showHyphenationPoints("légèrement", "fr")
légè-rement
> SILE.showHyphenationPoints("LÉGÈREMENT", "fr")
LÉGÈ-RE-MENT
> -- OOPS, expected "LÉGÈ-REMENT"