w3c / mlreq

Mongolian Layout Requirements

Home Page:https://www.w3.org/International/mlreq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Todo hyphen

mheijdra opened this issue · comments

Although the main discussion in this group is Mongolian in Traditional Mongolian, and not Todo, Manchu or Sibe, I would suggest incorporating a note on how the Todo hyphen is supposed to work.

commented

@mheijdra would you be able to propose some text for this? (If so, contact me at ishida@w3.org so i can take you through the necessary agreements.)

Btw, it’s mentioned by Badral Sanlig (@badaa) et al at April’s Mongolian Working Group meeting that the so called TODO SOFT HYPHEN is not only supposed to be used in Todo text, but it’s also used (or wanted, if there isn’t an established usage yet) in Mongolian/Hudum text. Badral might be able to provide more background information.

In the Unicode Standard, v11, p536:

In writing Mongolian and Todo, U+1806 mongolian todo soft hyphen is used at the beginning of the second line to indicate resumption of a broken word. It functions like U+2010 hyphen, except that U+1806 appears at the beginning of a line rather than at the end.

commented

Folks, we need to clarify the situation about the Todo soft hyphen character for CSS folks as well as for our gap analysis (which i suspect needs to be rewritten). cc @badaa @mheijdra @fantasai

My understanding is as follows:

  1. Mongolian words are not usually split during line-breaking (which is useful, because it avoids tricky questions about cursive behaviour at line breaks). However, compound words may contain hyphens. In those cases, the line may be broken at the hyphen. However, the hyphen should not remain on the same line, but should move to the next line.
  2. Unicode provides U+1806 TODO SOFT HYPHEN for that. It has a line-break property value of BB (Break Before), so it should end up in the right place. Furthermore, it is not a formatting character (ie. it is always visible, just like U+2010 HYPHEN, and doesn't appear/disappear in certain contexts). Also, it's not only used in Todo, but also in Mongolian text.

Here are my questions:

  1. is my summary above correct?
  2. can anyone provide me with an example of such a compound word? I had a search but didn't find 1806 being used online, but perhaps it should be used rather than an ordinary hyphen for ᠠᠲ᠋ᠠ‍᠆ᠮᠠᠯᠢᠺ and ᠠᠳ᠋᠆ᠳᠢᠨ‍ ? source
  3. is this hyphen only used for compound words, or should it be used generally for bits of text that are separated by hyphens? For example, would it be used for 2021-2030, or MK-DOS, or ᠪᠠᠷᠢᠮᠵᠢᠶ᠎ᠠ-8045, etc?

thanks for your help.

commented

I tried it out on the major browser engines. For a test and results see w3c/line_paragraph_tests#84

Summary: Works on Blink & WebKit (though the latter doesn't produce readable text), but fails on Gecko.

commented

perhaps it should be used rather than an ordinary hyphen for ᠠᠲ᠋ᠠ‍᠆ᠮᠠᠯᠢᠺ and ᠠᠳ᠋᠆ᠳᠢᠨ‍ ?

@mheijdra do you not think these compounds are candidates for examples?

Hi folks,

Not a native speaker of Mongolian, and I've haven't studied the language in over a decade but I still have all my reference material. According to Mongolian Grammar (D. Tserenpil, R. Kullman) we have the following description of hyphen usage in the script on p.402:

Marks local or temporal beginnings and ends.

On p.405, usage is extended for Cyrillic:

Hyphens are used to connect two words, numbers and their suffixes, and also to connect personal names: If the second of the two names begins with a vowel, then a hyphen is put in between them. In literature, works of similar meaning are equated using a hyphen

I found an example of the word "July" on p.85. The Cyrillic uses the hyphen, the script uses a space.

I also happened to find an English compound word on p.198 "pitch-black" which is not a hyphen compound in either Cyrillic or the script; just broken into two separate words.

My assumption, due to the language being classified as agglutinated, is that there are probably no hyphenated compound words native to the script. Moreover that the hyphen usage in the script is as stated above, only used for datetime range.

As a final rough check, I paged through my Mongolian English Dictionary (F. Lessing) looking for words with Cyrillic hyphens and didn't come across any. If you have something you'd like me to look up in particular to view the script, I have the physical copy.

Actually, the usage of the TODO SOFT HYPHEN would be high, if it works as expected or correctly. U+1806 is probably meant to connect Mongolian suffixes to abbreviations or foreign words. For example: Director General of UN will be written in Mongolian ᠨᠡ‍᠂ᠦ‍᠂ᠪᠠ‍᠂᠆ᠤᠨ ᠶᠡᠷᠦᠩᢈᠡᠢ ᠵᠠᢈᠢᠷᠤᠯ, and declinable by all grammar cases and other clauses. There exist numerous suffixes as Mongolian language is morphology rich.
Now what is the problem?
The problem is the suffixes must start without Titem which means suffixes must start like medial variant. Thus people use normal hyphen + nirugu in this case to show their text correctly.
It is what I can spontaneously say off by heart. I will ask other use cases from Jamiyan.
Remark:
It should not be confused with hyphen which is popularly used to write (compound) human names in Mongolia. Human names are written with hyphen if the second word starts with a vowel. It is already standardized in official Cyrillic script. Thus, people write always their names with hyphen also in Mongolian script. For example, Алтан-Уул, Altan-Uul (meaning Goldberg).

I pointed a few language experts to this conversation and Tim Brookes of Endangered Alphabets says he may know someone who could answer the questions here. Contact him at tim@endangeredalphabets.com and he’ll help out.

commented

@ZmongolCode Thank you for your comments. While i share your frustration at the time it is taking to arrive at the best encoding model for Mongolian, and i sympathize with the need to keep Mongolian simple, it's not something we can fix here at the W3C. I recommend you contact the Unicode Consortium, who are the people working on that.

For clarity (for everyone participating in this discussion), this thread is not about MVS or FVS. It's about U+1806 TODO SOFT HYPHEN (or similar visible hyphens/dashes) and their relationship with line-breaking. If you'd like to discuss something else, feel free to start another issue.

Housekeeping note for those who are replying via email, rather than using the GitHub interface: please delete other people's emails from the bottom of your message before sending. That reduces clutter and makes it easier to follow the GH thread. Thanks.