Relax bare key restrictions to allow additional unicode letters and numbers

Question

Relax bare key restrictions to allow additional unicode letters and numbers

marzer opened this issue 5 years ago · comments

Issue

TOML's "bare key" syntax is too restrictive. People who regularly use characters from languages other than English should be able to do so in TOML keys without additional gymnastics.

I know there's already been a lot of discussion about this but much of it was from when TOML was less established and I think it warrants revisiting.

Proposed change

Expand the set of accepted characters allowed in bare keys to include letters and numbers from the entire Unicode space, similar to how identifiers are handled in other Unicode-compliant contexts (e.g. python, javascript, etc.). Specifically:

Allow codepoints from categories Ll, Lm, Lo, Lt, Lu, Nd and Nl anywhere in a bare key
Allow codepoints from categories Mc and Mn anywhere in a bare key except as the first character

Rationale

After reading much of the existing discussion on the issue, I've identified the points below as being the main objections. I've written a counterpoint for each.

"ASCII-only is easy to understand"

Allowing Unicode letters and numbers wouldn't change the understandability of the written word in "mostly-ASCII" contexts, excepting maybe people from English-centric countries encountering characters they otherwise rarely see and being unsure how to pronounce them. I'm one of those people and my brain seems to consume them just fine. And it's almost certainly going to improve the understandability of bare keys to people for whom an ASCII environment is not their regular one.

It also wouldn't change the semantic/syntatic understandability of the language; I'm only advocating relaxing the spec to allow letter and number characters, not anything that might be confused for a language construct (no math symbols, for instance).

"Guides users to choose simple key names"

See above. I'd argue that the keys would be no less simple with this change. I live and work in a European country and a number of my friends and colleagues have non-ASCII letters in their name (e.g. ä). I doubt they consider their names to be complex; I certainly don't. If anything, by forcing people to jump through hoops just to type in their language, we're actually making the key names more complex w.r.t. cognitive load.

"Eliminate any weirdness that could come from having to deal with undelimited Unicode"

The TOML spec dictates UTF-8, not UTF-8-ish. UTF-8 is a solved problem at this point. If a parser doesn't correctly detect and handle malformed UTF-8, I'd argue that the parser needs fixing, not that we should bend over to accommodate users who are using crap tools and libraries. It's such a solved problem that you can even portably consume it using a state machine and validate it using vector intrinsics.

"Keys should be identifier-like"

Despite the fact that the concept of an "identifier" isn't a thing in TOML, I'll concede that in some situations this might be a concern. A reasonable example is using TOML in code generation contexts; if you used TOML keys to inform variable names historically you'd run into issues in many languages with non-ASCII characters, though this is no longer true. Even good old C++ supports unicode characters in identifiers on modern compilers.

...all of which is rendered moot by the fact that TOML supports hyphens in bare keys which are often invalid in identifier contexts, so this objection is a non-starter anyway.

"It complicates implementation"

It really doesn't. Many implementations will be able to leverage built-in helper functions or libraries for working with Unicode. For those that can't, I've put my money where my mouth is and implemented this as a proof-of-concept in my own TOML parser and I'm happy for my code to be used as a starting point:

Unicode utils, with is_unicode_XXXXX() codepoint identity functions (generated by a script)
Parser

Of course you might argue that simply accepting UTF-8 bytes from a TOML implementation is not an option for everyone, and you'd be right; there will always be situations where only ASCII makes sense (e.g. legacy codebases). I'd respond by pointing out that detecting non-ASCII characters in a character stream is laughably trivial. Applications requiring ASCII-only can easily enforce this themselves.

Christian Siefkes · Answer 1 · Wed Dec 11 2019 22:32:53 GMT+0800 (China Standard Time)

I wholeheartedly support this. Identifiers in non-English languages should not be discriminated against in TOML (and having to put them in quotes is a form of discrimination). [Note: the following refers to the original form of the proposal, which since then has been considerably extended.] Admittedly, with this proposal this would still only be the case for languages using the Latin script, but not for Russian, Arabic, Chinese etc. But it would still be a step in the right direction – and I see that there might be issues with allowing, say, Cyrillic letters, because they might be used to spoof a key that looks like ASCII but actually isn't. With Latin diacritics, this risk is much lower.

For completeness, I'd suggest to also support Latin Extended-B (Pan-Nigerian alphabet, Pinyin, Romanian), Latin Extended Additional (Vietnamese), and Latin Extended-C (Shona, a Bantu language).

Alexey Sergin · Answer 2 · Wed Dec 11 2019 23:33:55 GMT+0800 (China Standard Time)

So there would be no simple rule for human writer to decide if quotes are necessary.

Alexey Sergin · Answer 3 · Wed Dec 11 2019 23:37:01 GMT+0800 (China Standard Time)

Even good old C++ supports unicode characters in identifiers on modern compilers.

Do "modern C++ compilers" support most of Unicode, or only chosen subset of Latin Extended?

Mark Gillard · Answer 4 · Thu Dec 12 2019 05:49:53 GMT+0800 (China Standard Time)

@ChristianSi Certainly there's lots of additional characters we could add. My list of suggestions was in no way meant to be exhaustive and I'm hoping that a more useful set of ranges is borne out of discussion. Being an Australian who only speaks English does limit my perspective a bit here!

@lmna

So there would be no simple rule for human writer to decide if quotes are necessary.

Technically-speaking the rule could be: quotes if you need whitespace, an escape code, or a TOML-reserved character, otherwise anything goes. My feeling is that requiring users to think about this at all is getting away from the design goals of TOML and drifting too far into Think-Like-A-Programmer territory, which risks defeating the purpose of a simple config file format that intends to "just work" the way people would expect in the layman case.

Do "modern C++ compilers" support most of Unicode, or only chosen subset of Latin Extended?

Absolutely no idea. Here's a live demo of some Unicode on Clang, GCC and MSVC; I encourage you to experiment. I'm certain if we needed a more definitive answer we could read the compiler source code (Clang or GCC) or ask the relevant developers (MSVC).

Pradyun Gedam · Answer 5 · Thu Dec 12 2019 21:13:49 GMT+0800 (China Standard Time)

I'm on board.

I like the way Python handles alphabetical characters:

Alphabetic characters are those characters defined in the Unicode character database as “Letter”

That should be do-able for us, but I wonder how could complicate TOML implementations which are currently just doing a simple ASCII-value-range check. This also means we'd have to add a lot to TOML's ABNF.

Christian Siefkes · Answer 6 · Thu Dec 12 2019 22:28:37 GMT+0800 (China Standard Time)

@pradyunsg: I too would be fine with saying "Bare keys may contain arbitrary Unicode letters as well as ASCII digits, underscores, and dashes". But in effect this would likely mean that implementations would have to depend on some kind of Unicode library – easy in Python, where the isalpha() check is built in, probably not so easy in some other languages.

An advantage of @marzer 's original proposal, or my somewhat modified one, is that it would be easy to enumerate the affected ranges manually, in the ABNF or in code. With arbitrary Unicode letters this likely becomes effectively impossible – especially since the ranges would have to be extended with each new version of the Unicode standard.

But, of course, we might also decide to allow letters in arbitrary languages and scripts, not just Latin-based ones, and accept the Unicode dependency.

Christian Siefkes · Answer 7 · Thu Dec 12 2019 22:32:50 GMT+0800 (China Standard Time)

Here's a listing of all the Unicode Character Categories and all the characters that belong to each one. Unsurprisingly, it's quite long!

Mark Gillard · Answer 8 · Thu Dec 12 2019 22:35:32 GMT+0800 (China Standard Time)

@ChristianSi Given that TOML is supposed to be UTF-8 I'm inclined to think that requiring implementations use unicode machinery, hand-rolled or otherwise, isn't really a big deal, regardless of the direction this proposal takes.

@pradyunsg I too like the python approach, and I don't think it would be too hard to implement. I'm currently writing a TOML library of my own and I'd be happy to build it into my utf-8 decoder as a proof-of-concept, if that's useful.

Mark Gillard · Answer 9 · Fri Dec 13 2019 04:08:16 GMT+0800 (China Standard Time)

@ChristianSi I wrote a script to scrape letter characters from the website you linked, sort them and list them as ranges. If you omit the letter categories Lm and Lo the set of character ranges seems totally manageable:

removed, see below

@pradyunsg I don't know much about ABNF's but if characters can be expressed as ranges this wouldn't be much work.

Christian Siefkes · Answer 10 · Fri Dec 13 2019 17:08:08 GMT+0800 (China Standard Time)

@marzer Interesting. But the problem is that nearly all Unicode letters are in the "Letter, other" (Lo) category – 97 percent according to Wikipedia. Ignoring them, you only get the letters in alphabets that distinguish between upper case and lower case forms – Latin, Cyrillic, Greek, and a few others. But most writing systems don't – e.g. those used to write Chinese, Arabic, Hebrew, Korean, and certain Indian languages such as Tamil and Telugu know no such distinction. Hence their letters go into the "other" category.

What happens when you consider all letters? I suppose ranges become a bit unwieldy?

Mark Gillard · Answer 11 · Fri Dec 13 2019 17:15:34 GMT+0800 (China Standard Time)

@ChristianSi Surprisingly it's not that unwieldy:

removed, see below

Christian Siefkes · Answer 12 · Fri Dec 13 2019 17:59:00 GMT+0800 (China Standard Time)

@marzer Ooo-kay. But it seems that website is outdated or incomplete. It only lists 16249 Other Letters, while Wikipedia says there should be 121414. This list seems complete.

Moreover, you should also add the Lm (Letter, modifier) category – there are only 259 of them.

Mark Gillard · Answer 13 · Fri Dec 13 2019 18:01:05 GMT+0800 (China Standard Time)

@ChristianSi Ah, good find. I'll update the script later tonight and see how it looks.

Pradyun Gedam · Answer 14 · Fri Dec 13 2019 18:03:30 GMT+0800 (China Standard Time)

Thanks for exploring this @ChristianSi and @marzer! ^>^

Pradyun Gedam · Answer 15 · Fri Dec 13 2019 18:13:08 GMT+0800 (China Standard Time)

Marking this as a post-1.0 change, since I imagine this relaxation would not make any valid documents invalid -- thus, we can augment this in a non-major version bump.

Mark Gillard · Answer 16 · Fri Dec 13 2019 23:12:09 GMT+0800 (China Standard Time)

@ChristianSi Ok, I updated the script to scrape directly from the unicode consortium's character database and amended it to include all of the letter characters, and it looks like this:

Added 125634 codepoints from 5 categories.
Ranges:
        0x41 => 0x5A
        0x61 => 0x7A
        0xAA
        0xB5
        0xBA
        0xC0 => 0xD6
        0xD8 => 0xF6
        0xF8 => 0x2C1
        0x2C6 => 0x2D1
        0x2E0 => 0x2E4
        0x2EC
        0x2EE
        0x370 => 0x374
        0x376 => 0x377
        0x37A => 0x37D
        0x37F
        0x386
        0x388 => 0x38A
        0x38C
        0x38E => 0x3A1
        0x3A3 => 0x3F5
        0x3F7 => 0x481
        0x48A => 0x52F
        0x531 => 0x556
        0x559
        0x560 => 0x588
        0x5D0 => 0x5EA
        0x5EF => 0x5F2
        0x620 => 0x64A
        0x66E => 0x66F
        0x671 => 0x6D3
        0x6D5
        0x6E5 => 0x6E6
        0x6EE => 0x6EF
        0x6FA => 0x6FC
        0x6FF
        0x710
        0x712 => 0x72F
        0x74D => 0x7A5
        0x7B1
        0x7CA => 0x7EA
        0x7F4 => 0x7F5
        0x7FA
        0x800 => 0x815
        0x81A
        0x824
        0x828
        0x840 => 0x858
        0x860 => 0x86A
        0x8A0 => 0x8B4
        0x8B6 => 0x8BD
        0x904 => 0x939
        0x93D
        0x950
        0x958 => 0x961
        0x971 => 0x980
        0x985 => 0x98C
        0x98F => 0x990
        0x993 => 0x9A8
        0x9AA => 0x9B0
        0x9B2
        0x9B6 => 0x9B9
        0x9BD
        0x9CE
        0x9DC => 0x9DD
        0x9DF => 0x9E1
        0x9F0 => 0x9F1
        0x9FC
        0xA05 => 0xA0A
        0xA0F => 0xA10
        0xA13 => 0xA28
        0xA2A => 0xA30
        0xA32 => 0xA33
        0xA35 => 0xA36
        0xA38 => 0xA39
        0xA59 => 0xA5C
        0xA5E
        0xA72 => 0xA74
        0xA85 => 0xA8D
        0xA8F => 0xA91
        0xA93 => 0xAA8
        0xAAA => 0xAB0
        0xAB2 => 0xAB3
        0xAB5 => 0xAB9
        0xABD
        0xAD0
        0xAE0 => 0xAE1
        0xAF9
        0xB05 => 0xB0C
        0xB0F => 0xB10
        0xB13 => 0xB28
        0xB2A => 0xB30
        0xB32 => 0xB33
        0xB35 => 0xB39
        0xB3D
        0xB5C => 0xB5D
        0xB5F => 0xB61
        0xB71
        0xB83
        0xB85 => 0xB8A
        0xB8E => 0xB90
        0xB92 => 0xB95
        0xB99 => 0xB9A
        0xB9C
        0xB9E => 0xB9F
        0xBA3 => 0xBA4
        0xBA8 => 0xBAA
        0xBAE => 0xBB9
        0xBD0
        0xC05 => 0xC0C
        0xC0E => 0xC10
        0xC12 => 0xC28
        0xC2A => 0xC39
        0xC3D
        0xC58 => 0xC5A
        0xC60 => 0xC61
        0xC80
        0xC85 => 0xC8C
        0xC8E => 0xC90
        0xC92 => 0xCA8
        0xCAA => 0xCB3
        0xCB5 => 0xCB9
        0xCBD
        0xCDE
        0xCE0 => 0xCE1
        0xCF1 => 0xCF2
        0xD05 => 0xD0C
        0xD0E => 0xD10
        0xD12 => 0xD3A
        0xD3D
        0xD4E
        0xD54 => 0xD56
        0xD5F => 0xD61
        0xD7A => 0xD7F
        0xD85 => 0xD96
        0xD9A => 0xDB1
        0xDB3 => 0xDBB
        0xDBD
        0xDC0 => 0xDC6
        0xE01 => 0xE30
        0xE32 => 0xE33
        0xE40 => 0xE46
        0xE81 => 0xE82
        0xE84
        0xE86 => 0xE8A
        0xE8C => 0xEA3
        0xEA5
        0xEA7 => 0xEB0
        0xEB2 => 0xEB3
        0xEBD
        0xEC0 => 0xEC4
        0xEC6
        0xEDC => 0xEDF
        0xF00
        0xF40 => 0xF47
        0xF49 => 0xF6C
        0xF88 => 0xF8C
        0x1000 => 0x102A
        0x103F
        0x1050 => 0x1055
        0x105A => 0x105D
        0x1061
        0x1065 => 0x1066
        0x106E => 0x1070
        0x1075 => 0x1081
        0x108E
        0x10A0 => 0x10C5
        0x10C7
        0x10CD
        0x10D0 => 0x10FA
        0x10FC => 0x1248
        0x124A => 0x124D
        0x1250 => 0x1256
        0x1258
        0x125A => 0x125D
        0x1260 => 0x1288
        0x128A => 0x128D
        0x1290 => 0x12B0
        0x12B2 => 0x12B5
        0x12B8 => 0x12BE
        0x12C0
        0x12C2 => 0x12C5
        0x12C8 => 0x12D6
        0x12D8 => 0x1310
        0x1312 => 0x1315
        0x1318 => 0x135A
        0x1380 => 0x138F
        0x13A0 => 0x13F5
        0x13F8 => 0x13FD
        0x1401 => 0x166C
        0x166F => 0x167F
        0x1681 => 0x169A
        0x16A0 => 0x16EA
        0x16F1 => 0x16F8
        0x1700 => 0x170C
        0x170E => 0x1711
        0x1720 => 0x1731
        0x1740 => 0x1751
        0x1760 => 0x176C
        0x176E => 0x1770
        0x1780 => 0x17B3
        0x17D7
        0x17DC
        0x1820 => 0x1878
        0x1880 => 0x1884
        0x1887 => 0x18A8
        0x18AA
        0x18B0 => 0x18F5
        0x1900 => 0x191E
        0x1950 => 0x196D
        0x1970 => 0x1974
        0x1980 => 0x19AB
        0x19B0 => 0x19C9
        0x1A00 => 0x1A16
        0x1A20 => 0x1A54
        0x1AA7
        0x1B05 => 0x1B33
        0x1B45 => 0x1B4B
        0x1B83 => 0x1BA0
        0x1BAE => 0x1BAF
        0x1BBA => 0x1BE5
        0x1C00 => 0x1C23
        0x1C4D => 0x1C4F
        0x1C5A => 0x1C7D
        0x1C80 => 0x1C88
        0x1C90 => 0x1CBA
        0x1CBD => 0x1CBF
        0x1CE9 => 0x1CEC
        0x1CEE => 0x1CF3
        0x1CF5 => 0x1CF6
        0x1CFA
        0x1D00 => 0x1DBF
        0x1E00 => 0x1F15
        0x1F18 => 0x1F1D
        0x1F20 => 0x1F45
        0x1F48 => 0x1F4D
        0x1F50 => 0x1F57
        0x1F59
        0x1F5B
        0x1F5D
        0x1F5F => 0x1F7D
        0x1F80 => 0x1FB4
        0x1FB6 => 0x1FBC
        0x1FBE
        0x1FC2 => 0x1FC4
        0x1FC6 => 0x1FCC
        0x1FD0 => 0x1FD3
        0x1FD6 => 0x1FDB
        0x1FE0 => 0x1FEC
        0x1FF2 => 0x1FF4
        0x1FF6 => 0x1FFC
        0x2071
        0x207F
        0x2090 => 0x209C
        0x2102
        0x2107
        0x210A => 0x2113
        0x2115
        0x2119 => 0x211D
        0x2124
        0x2126
        0x2128
        0x212A => 0x212D
        0x212F => 0x2139
        0x213C => 0x213F
        0x2145 => 0x2149
        0x214E
        0x2183 => 0x2184
        0x2C00 => 0x2C2E
        0x2C30 => 0x2C5E
        0x2C60 => 0x2CE4
        0x2CEB => 0x2CEE
        0x2CF2 => 0x2CF3
        0x2D00 => 0x2D25
        0x2D27
        0x2D2D
        0x2D30 => 0x2D67
        0x2D6F
        0x2D80 => 0x2D96
        0x2DA0 => 0x2DA6
        0x2DA8 => 0x2DAE
        0x2DB0 => 0x2DB6
        0x2DB8 => 0x2DBE
        0x2DC0 => 0x2DC6
        0x2DC8 => 0x2DCE
        0x2DD0 => 0x2DD6
        0x2DD8 => 0x2DDE
        0x2E2F
        0x3005 => 0x3006
        0x3031 => 0x3035
        0x303B => 0x303C
        0x3041 => 0x3096
        0x309D => 0x309F
        0x30A1 => 0x30FA
        0x30FC => 0x30FF
        0x3105 => 0x312F
        0x3131 => 0x318E
        0x31A0 => 0x31BA
        0x31F0 => 0x31FF
        0x3400 => 0x4DB4
        0x4E00 => 0x9FEE
        0xA000 => 0xA48C
        0xA4D0 => 0xA4FD
        0xA500 => 0xA60C
        0xA610 => 0xA61F
        0xA62A => 0xA62B
        0xA640 => 0xA66E
        0xA67F => 0xA69D
        0xA6A0 => 0xA6E5
        0xA717 => 0xA71F
        0xA722 => 0xA788
        0xA78B => 0xA7BF
        0xA7C2 => 0xA7C6
        0xA7F7 => 0xA801
        0xA803 => 0xA805
        0xA807 => 0xA80A
        0xA80C => 0xA822
        0xA840 => 0xA873
        0xA882 => 0xA8B3
        0xA8F2 => 0xA8F7
        0xA8FB
        0xA8FD => 0xA8FE
        0xA90A => 0xA925
        0xA930 => 0xA946
        0xA960 => 0xA97C
        0xA984 => 0xA9B2
        0xA9CF
        0xA9E0 => 0xA9E4
        0xA9E6 => 0xA9EF
        0xA9FA => 0xA9FE
        0xAA00 => 0xAA28
        0xAA40 => 0xAA42
        0xAA44 => 0xAA4B
        0xAA60 => 0xAA76
        0xAA7A
        0xAA7E => 0xAAAF
        0xAAB1
        0xAAB5 => 0xAAB6
        0xAAB9 => 0xAABD
        0xAAC0
        0xAAC2
        0xAADB => 0xAADD
        0xAAE0 => 0xAAEA
        0xAAF2 => 0xAAF4
        0xAB01 => 0xAB06
        0xAB09 => 0xAB0E
        0xAB11 => 0xAB16
        0xAB20 => 0xAB26
        0xAB28 => 0xAB2E
        0xAB30 => 0xAB5A
        0xAB5C => 0xAB67
        0xAB70 => 0xABE2
        0xAC00 => 0xD7A2
        0xD7B0 => 0xD7C6
        0xD7CB => 0xD7FB
        0xF900 => 0xFA6D
        0xFA70 => 0xFAD9
        0xFB00 => 0xFB06
        0xFB13 => 0xFB17
        0xFB1D
        0xFB1F => 0xFB28
        0xFB2A => 0xFB36
        0xFB38 => 0xFB3C
        0xFB3E
        0xFB40 => 0xFB41
        0xFB43 => 0xFB44
        0xFB46 => 0xFBB1
        0xFBD3 => 0xFD3D
        0xFD50 => 0xFD8F
        0xFD92 => 0xFDC7
        0xFDF0 => 0xFDFB
        0xFE70 => 0xFE74
        0xFE76 => 0xFEFC
        0xFF21 => 0xFF3A
        0xFF41 => 0xFF5A
        0xFF66 => 0xFFBE
        0xFFC2 => 0xFFC7
        0xFFCA => 0xFFCF
        0xFFD2 => 0xFFD7
        0xFFDA => 0xFFDC
        0x10000 => 0x1000B
        0x1000D => 0x10026
        0x10028 => 0x1003A
        0x1003C => 0x1003D
        0x1003F => 0x1004D
        0x10050 => 0x1005D
        0x10080 => 0x100FA
        0x10280 => 0x1029C
        0x102A0 => 0x102D0
        0x10300 => 0x1031F
        0x1032D => 0x10340
        0x10342 => 0x10349
        0x10350 => 0x10375
        0x10380 => 0x1039D
        0x103A0 => 0x103C3
        0x103C8 => 0x103CF
        0x10400 => 0x1049D
        0x104B0 => 0x104D3
        0x104D8 => 0x104FB
        0x10500 => 0x10527
        0x10530 => 0x10563
        0x10600 => 0x10736
        0x10740 => 0x10755
        0x10760 => 0x10767
        0x10800 => 0x10805
        0x10808
        0x1080A => 0x10835
        0x10837 => 0x10838
        0x1083C
        0x1083F => 0x10855
        0x10860 => 0x10876
        0x10880 => 0x1089E
        0x108E0 => 0x108F2
        0x108F4 => 0x108F5
        0x10900 => 0x10915
        0x10920 => 0x10939
        0x10980 => 0x109B7
        0x109BE => 0x109BF
        0x10A00
        0x10A10 => 0x10A13
        0x10A15 => 0x10A17
        0x10A19 => 0x10A35
        0x10A60 => 0x10A7C
        0x10A80 => 0x10A9C
        0x10AC0 => 0x10AC7
        0x10AC9 => 0x10AE4
        0x10B00 => 0x10B35
        0x10B40 => 0x10B55
        0x10B60 => 0x10B72
        0x10B80 => 0x10B91
        0x10C00 => 0x10C48
        0x10C80 => 0x10CB2
        0x10CC0 => 0x10CF2
        0x10D00 => 0x10D23
        0x10F00 => 0x10F1C
        0x10F27
        0x10F30 => 0x10F45
        0x10FE0 => 0x10FF6
        0x11003 => 0x11037
        0x11083 => 0x110AF
        0x110D0 => 0x110E8
        0x11103 => 0x11126
        0x11144
        0x11150 => 0x11172
        0x11176
        0x11183 => 0x111B2
        0x111C1 => 0x111C4
        0x111DA
        0x111DC
        0x11200 => 0x11211
        0x11213 => 0x1122B
        0x11280 => 0x11286
        0x11288
        0x1128A => 0x1128D
        0x1128F => 0x1129D
        0x1129F => 0x112A8
        0x112B0 => 0x112DE
        0x11305 => 0x1130C
        0x1130F => 0x11310
        0x11313 => 0x11328
        0x1132A => 0x11330
        0x11332 => 0x11333
        0x11335 => 0x11339
        0x1133D
        0x11350
        0x1135D => 0x11361
        0x11400 => 0x11434
        0x11447 => 0x1144A
        0x1145F
        0x11480 => 0x114AF
        0x114C4 => 0x114C5
        0x114C7
        0x11580 => 0x115AE
        0x115D8 => 0x115DB
        0x11600 => 0x1162F
        0x11644
        0x11680 => 0x116AA
        0x116B8
        0x11700 => 0x1171A
        0x11800 => 0x1182B
        0x118A0 => 0x118DF
        0x118FF
        0x119A0 => 0x119A7
        0x119AA => 0x119D0
        0x119E1
        0x119E3
        0x11A00
        0x11A0B => 0x11A32
        0x11A3A
        0x11A50
        0x11A5C => 0x11A89
        0x11A9D
        0x11AC0 => 0x11AF8
        0x11C00 => 0x11C08
        0x11C0A => 0x11C2E
        0x11C40
        0x11C72 => 0x11C8F
        0x11D00 => 0x11D06
        0x11D08 => 0x11D09
        0x11D0B => 0x11D30
        0x11D46
        0x11D60 => 0x11D65
        0x11D67 => 0x11D68
        0x11D6A => 0x11D89
        0x11D98
        0x11EE0 => 0x11EF2
        0x12000 => 0x12399
        0x12480 => 0x12543
        0x13000 => 0x1342E
        0x14400 => 0x14646
        0x16800 => 0x16A38
        0x16A40 => 0x16A5E
        0x16AD0 => 0x16AED
        0x16B00 => 0x16B2F
        0x16B40 => 0x16B43
        0x16B63 => 0x16B77
        0x16B7D => 0x16B8F
        0x16E40 => 0x16E7F
        0x16F00 => 0x16F4A
        0x16F50
        0x16F93 => 0x16F9F
        0x16FE0 => 0x16FE1
        0x16FE3
        0x17000 => 0x187F6
        0x18800 => 0x18AF2
        0x1B000 => 0x1B11E
        0x1B150 => 0x1B152
        0x1B164 => 0x1B167
        0x1B170 => 0x1B2FB
        0x1BC00 => 0x1BC6A
        0x1BC70 => 0x1BC7C
        0x1BC80 => 0x1BC88
        0x1BC90 => 0x1BC99
        0x1D400 => 0x1D454
        0x1D456 => 0x1D49C
        0x1D49E => 0x1D49F
        0x1D4A2
        0x1D4A5 => 0x1D4A6
        0x1D4A9 => 0x1D4AC
        0x1D4AE => 0x1D4B9
        0x1D4BB
        0x1D4BD => 0x1D4C3
        0x1D4C5 => 0x1D505
        0x1D507 => 0x1D50A
        0x1D50D => 0x1D514
        0x1D516 => 0x1D51C
        0x1D51E => 0x1D539
        0x1D53B => 0x1D53E
        0x1D540 => 0x1D544
        0x1D546
        0x1D54A => 0x1D550
        0x1D552 => 0x1D6A5
        0x1D6A8 => 0x1D6C0
        0x1D6C2 => 0x1D6DA
        0x1D6DC => 0x1D6FA
        0x1D6FC => 0x1D714
        0x1D716 => 0x1D734
        0x1D736 => 0x1D74E
        0x1D750 => 0x1D76E
        0x1D770 => 0x1D788
        0x1D78A => 0x1D7A8
        0x1D7AA => 0x1D7C2
        0x1D7C4 => 0x1D7CB
        0x1E100 => 0x1E12C
        0x1E137 => 0x1E13D
        0x1E14E
        0x1E2C0 => 0x1E2EB
        0x1E800 => 0x1E8C4
        0x1E900 => 0x1E943
        0x1E94B
        0x1EE00 => 0x1EE03
        0x1EE05 => 0x1EE1F
        0x1EE21 => 0x1EE22
        0x1EE24
        0x1EE27
        0x1EE29 => 0x1EE32
        0x1EE34 => 0x1EE37
        0x1EE39
        0x1EE3B
        0x1EE42
        0x1EE47
        0x1EE49
        0x1EE4B
        0x1EE4D => 0x1EE4F
        0x1EE51 => 0x1EE52
        0x1EE54
        0x1EE57
        0x1EE59
        0x1EE5B
        0x1EE5D
        0x1EE5F
        0x1EE61 => 0x1EE62
        0x1EE64
        0x1EE67 => 0x1EE6A
        0x1EE6C => 0x1EE72
        0x1EE74 => 0x1EE77
        0x1EE79 => 0x1EE7C
        0x1EE7E
        0x1EE80 => 0x1EE89
        0x1EE8B => 0x1EE9B
        0x1EEA1 => 0x1EEA3
        0x1EEA5 => 0x1EEA9
        0x1EEAB => 0x1EEBB
        0x20000 => 0x2A6D5
        0x2A700 => 0x2B733
        0x2B740 => 0x2B81C
        0x2B820 => 0x2CEA0
        0x2CEB0 => 0x2EBDF
        0x2F800 => 0x2FA1D

Not really any worse than before, even considering it's 125634 characters.

Dave Ostroske · Answer 17 · Sat Dec 14 2019 01:34:31 GMT+0800 (China Standard Time)

@marzer With what you provided, a PR could be prepared fairly quickly. Could you write that list similarly to how ucschar is written in RFC 3987? You don't need to wrap it; we can do that. But instead of e.g. 0x2F800 => 0x2FA1D, can you write %x2F800-2FA1D instead?

For reference, the part of RFC 3987 I'm referring to looks like this:

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

Mark Gillard · Answer 18 · Sat Dec 14 2019 01:44:41 GMT+0800 (China Standard Time)

@eksortso here you go:

removed, see below

Christian Siefkes · Answer 19 · Sat Dec 14 2019 20:09:28 GMT+0800 (China Standard Time)

Thanks a lot for your efforts, @marzer ! That looks good so far and indeed manageable, but there are a few complications we have missed so far. I looked at what Python 3 and JavaScript allow in identifiers.

In addition to the five Letter categories we have already, they allow "Letter Number" (Nl) anywhere in an identifier and "Decimal Number" (Nd) anywhere, except at the start. TOML already allows all-numeric keys (they never occur in a position where they can be confused with actual numbers) so I'd consider it reasonable to allow both these categories anywhere in a key – people using Bengali letters in keys might, for example, reasonably expect to be able to use Bengali digits as well. Together they comprise less than 900 characters, so adding them should be quite manageable.

The final Number category (Other Number – No) is not allowed in identifiers in either language.

Moreover, both languages allow anywhere, except at the start, "Nonspacing Mark" (Mn) and "Spacing Mark" (Mc). Now it's important to understand that in Unicode, Marks (Mx categories) are always combining characters – they become logically attached to the preceding character and modify it. For example, Mn contains the "combining grave accent" which goes over the preceding letter and modifies it; Mc contains various Bengali vowel signs which likewise modify the preceding (supposedly Bengali) letter.

Hence it seems indeed important that we support these two categories too, since they are necessary to write certain words in certain languages – without them, support for multilingual bare keys would be incomplete and people might get odd error messages. It's also important that we must NOT allow them at the start of a bare key, since otherwise they would try to modify the preceding non-key character (likely a newline, space, or [ or . in table names or dotted keys) which would be nonsensical and blur the boundary at the start of a key. Together these categories have about 2250 entries, which likewise is manageable.

Finally, both JS and Python allow, except at the start, Connector Punctuation (Pc). That's a very short category with just 10 entries, including the underscore, which we allow already. I don't have strong feelings regarding this category, but would rather tend NOT to allow it in bare keys – we already have underscores and dashes as connectors, and, for example, the Centreline Low Line (﹎) with a tiny dot in the middle could theoretically be confused with the dots that actually separate key elements in hierarchical table names.

So, to summarize, I'd propose to additionally allow Nl and Nd anywhere in a bare key, and Mn and Mc anywhere except as first character (or code point, to be more exact).

In the README, we could then say:

Bare keys may contain arbitrary Unicode letters and digits as well as ASCII underscores (_) and dashes (-). (Technically, code points belonging to the Unicode categories Ll, Lm, Lo, Lt, Lu, Nd and Nl are allowed anywhere in a bare key, and those belonging to the categories Mc and Mn are allowed anywhere except as first code point.)

Mark Gillard · Answer 20 · Sat Dec 14 2019 23:11:55 GMT+0800 (China Standard Time)

@ChristianSi LGTM. This proposal has significantly broadened in scope from my original thought bubble, but definitely for the better.

Can the Mx codepoints appear consecutively? If not, we'd also need to clarify that codepoints from Mx categories cannot appear at the beginning of a key and immediately following another Mx codepoint.

Christian Siefkes · Answer 21 · Sun Dec 15 2019 00:35:08 GMT+0800 (China Standard Time)

@marzer:

This proposal has significantly broadened in scope from my original thought bubble, but definitely for the better.

Indeed!

Yes, consecutive Mx codepoints are allowed – they all modify the preceding letter, e.g. by placing an acute accent above and an ogonek below it. That's necessary for some languages, such as Navajo.

Mark Gillard · Answer 22 · Mon Dec 16 2019 01:57:08 GMT+0800 (China Standard Time)

Alright I've updated the issue text to better reflect the current state of the discussion, as well as including links to my proof-of-concept implementation. I've also updated the script to generate the ABNF notation for the three relevant 'super-categories' of codepoints, which generates this:

; unicode codepoints from categories Ll, Lm, Lo, Lt, Lu
letters = %x41-5A / %x61-7A / %xAA / %xB5 /
        %xBA / %xC0-D6 / %xD8-F6 / %xF8-2C1 /
        %x2C6-2D1 / %x2E0-2E4 / %x2EC / %x2EE /
        %x370-374 / %x376-377 / %x37A-37D / %x37F /
        %x386 / %x388-38A / %x38C / %x38E-3A1 /
        %x3A3-3F5 / %x3F7-481 / %x48A-52F / %x531-556 /
        %x559 / %x560-588 / %x5D0-5EA / %x5EF-5F2 /
        %x620-64A / %x66E-66F / %x671-6D3 / %x6D5 /
        %x6E5-6E6 / %x6EE-6EF / %x6FA-6FC / %x6FF /
        %x710 / %x712-72F / %x74D-7A5 / %x7B1 /
        %x7CA-7EA / %x7F4-7F5 / %x7FA / %x800-815 /
        %x81A / %x824 / %x828 / %x840-858 /
        %x860-86A / %x8A0-8B4 / %x8B6-8C7 / %x904-939 /
        %x93D / %x950 / %x958-961 / %x971-980 /
        %x985-98C / %x98F-990 / %x993-9A8 / %x9AA-9B0 /
        %x9B2 / %x9B6-9B9 / %x9BD / %x9CE /
        %x9DC-9DD / %x9DF-9E1 / %x9F0-9F1 / %x9FC /
        %xA05-A0A / %xA0F-A10 / %xA13-A28 / %xA2A-A30 /
        %xA32-A33 / %xA35-A36 / %xA38-A39 / %xA59-A5C /
        %xA5E / %xA72-A74 / %xA85-A8D / %xA8F-A91 /
        %xA93-AA8 / %xAAA-AB0 / %xAB2-AB3 / %xAB5-AB9 /
        %xABD / %xAD0 / %xAE0-AE1 / %xAF9 /
        %xB05-B0C / %xB0F-B10 / %xB13-B28 / %xB2A-B30 /
        %xB32-B33 / %xB35-B39 / %xB3D / %xB5C-B5D /
        %xB5F-B61 / %xB71 / %xB83 / %xB85-B8A /
        %xB8E-B90 / %xB92-B95 / %xB99-B9A / %xB9C /
        %xB9E-B9F / %xBA3-BA4 / %xBA8-BAA / %xBAE-BB9 /
        %xBD0 / %xC05-C0C / %xC0E-C10 / %xC12-C28 /
        %xC2A-C39 / %xC3D / %xC58-C5A / %xC60-C61 /
        %xC80 / %xC85-C8C / %xC8E-C90 / %xC92-CA8 /
        %xCAA-CB3 / %xCB5-CB9 / %xCBD / %xCDE /
        %xCE0-CE1 / %xCF1-CF2 / %xD04-D0C / %xD0E-D10 /
        %xD12-D3A / %xD3D / %xD4E / %xD54-D56 /
        %xD5F-D61 / %xD7A-D7F / %xD85-D96 / %xD9A-DB1 /
        %xDB3-DBB / %xDBD / %xDC0-DC6 / %xE01-E30 /
        %xE32-E33 / %xE40-E46 / %xE81-E82 / %xE84 /
        %xE86-E8A / %xE8C-EA3 / %xEA5 / %xEA7-EB0 /
        %xEB2-EB3 / %xEBD / %xEC0-EC4 / %xEC6 /
        %xEDC-EDF / %xF00 / %xF40-F47 / %xF49-F6C /
        %xF88-F8C / %x1000-102A / %x103F / %x1050-1055 /
        %x105A-105D / %x1061 / %x1065-1066 / %x106E-1070 /
        %x1075-1081 / %x108E / %x10A0-10C5 / %x10C7 /
        %x10CD / %x10D0-10FA / %x10FC-1248 / %x124A-124D /
        %x1250-1256 / %x1258 / %x125A-125D / %x1260-1288 /
        %x128A-128D / %x1290-12B0 / %x12B2-12B5 / %x12B8-12BE /
        %x12C0 / %x12C2-12C5 / %x12C8-12D6 / %x12D8-1310 /
        %x1312-1315 / %x1318-135A / %x1380-138F / %x13A0-13F5 /
        %x13F8-13FD / %x1401-166C / %x166F-167F / %x1681-169A /
        %x16A0-16EA / %x16F1-16F8 / %x1700-170C / %x170E-1711 /
        %x1720-1731 / %x1740-1751 / %x1760-176C / %x176E-1770 /
        %x1780-17B3 / %x17D7 / %x17DC / %x1820-1878 /
        %x1880-1884 / %x1887-18A8 / %x18AA / %x18B0-18F5 /
        %x1900-191E / %x1950-196D / %x1970-1974 / %x1980-19AB /
        %x19B0-19C9 / %x1A00-1A16 / %x1A20-1A54 / %x1AA7 /
        %x1B05-1B33 / %x1B45-1B4B / %x1B83-1BA0 / %x1BAE-1BAF /
        %x1BBA-1BE5 / %x1C00-1C23 / %x1C4D-1C4F / %x1C5A-1C7D /
        %x1C80-1C88 / %x1C90-1CBA / %x1CBD-1CBF / %x1CE9-1CEC /
        %x1CEE-1CF3 / %x1CF5-1CF6 / %x1CFA / %x1D00-1DBF /
        %x1E00-1F15 / %x1F18-1F1D / %x1F20-1F45 / %x1F48-1F4D /
        %x1F50-1F57 / %x1F59 / %x1F5B / %x1F5D /
        %x1F5F-1F7D / %x1F80-1FB4 / %x1FB6-1FBC / %x1FBE /
        %x1FC2-1FC4 / %x1FC6-1FCC / %x1FD0-1FD3 / %x1FD6-1FDB /
        %x1FE0-1FEC / %x1FF2-1FF4 / %x1FF6-1FFC / %x2071 /
        %x207F / %x2090-209C / %x2102 / %x2107 /
        %x210A-2113 / %x2115 / %x2119-211D / %x2124 /
        %x2126 / %x2128 / %x212A-212D / %x212F-2139 /
        %x213C-213F / %x2145-2149 / %x214E / %x2183-2184 /
        %x2C00-2C2E / %x2C30-2C5E / %x2C60-2CE4 / %x2CEB-2CEE /
        %x2CF2-2CF3 / %x2D00-2D25 / %x2D27 / %x2D2D /
        %x2D30-2D67 / %x2D6F / %x2D80-2D96 / %x2DA0-2DA6 /
        %x2DA8-2DAE / %x2DB0-2DB6 / %x2DB8-2DBE / %x2DC0-2DC6 /
        %x2DC8-2DCE / %x2DD0-2DD6 / %x2DD8-2DDE / %x2E2F /
        %x3005-3006 / %x3031-3035 / %x303B-303C / %x3041-3096 /
        %x309D-309F / %x30A1-30FA / %x30FC-30FF / %x3105-312F /
        %x3131-318E / %x31A0-31BF / %x31F0-31FF / %x3400-4DBF /
        %x4E00-9FFC / %xA000-A48C / %xA4D0-A4FD / %xA500-A60C /
        %xA610-A61F / %xA62A-A62B / %xA640-A66E / %xA67F-A69D /
        %xA6A0-A6E5 / %xA717-A71F / %xA722-A788 / %xA78B-A7BF /
        %xA7C2-A7CA / %xA7F5-A801 / %xA803-A805 / %xA807-A80A /
        %xA80C-A822 / %xA840-A873 / %xA882-A8B3 / %xA8F2-A8F7 /
        %xA8FB / %xA8FD-A8FE / %xA90A-A925 / %xA930-A946 /
        %xA960-A97C / %xA984-A9B2 / %xA9CF / %xA9E0-A9E4 /
        %xA9E6-A9EF / %xA9FA-A9FE / %xAA00-AA28 / %xAA40-AA42 /
        %xAA44-AA4B / %xAA60-AA76 / %xAA7A / %xAA7E-AAAF /
        %xAAB1 / %xAAB5-AAB6 / %xAAB9-AABD / %xAAC0 /
        %xAAC2 / %xAADB-AADD / %xAAE0-AAEA / %xAAF2-AAF4 /
        %xAB01-AB06 / %xAB09-AB0E / %xAB11-AB16 / %xAB20-AB26 /
        %xAB28-AB2E / %xAB30-AB5A / %xAB5C-AB69 / %xAB70-ABE2 /
        %xAC00-D7A3 / %xD7B0-D7C6 / %xD7CB-D7FB / %xF900-FA6D /
        %xFA70-FAD9 / %xFB00-FB06 / %xFB13-FB17 / %xFB1D /
        %xFB1F-FB28 / %xFB2A-FB36 / %xFB38-FB3C / %xFB3E /
        %xFB40-FB41 / %xFB43-FB44 / %xFB46-FBB1 / %xFBD3-FD3D /
        %xFD50-FD8F / %xFD92-FDC7 / %xFDF0-FDFB / %xFE70-FE74 /
        %xFE76-FEFC / %xFF21-FF3A / %xFF41-FF5A / %xFF66-FFBE /
        %xFFC2-FFC7 / %xFFCA-FFCF / %xFFD2-FFD7 / %xFFDA-FFDC /
        %x10000-1000B / %x1000D-10026 / %x10028-1003A / %x1003C-1003D /
        %x1003F-1004D / %x10050-1005D / %x10080-100FA / %x10280-1029C /
        %x102A0-102D0 / %x10300-1031F / %x1032D-10340 / %x10342-10349 /
        %x10350-10375 / %x10380-1039D / %x103A0-103C3 / %x103C8-103CF /
        %x10400-1049D / %x104B0-104D3 / %x104D8-104FB / %x10500-10527 /
        %x10530-10563 / %x10600-10736 / %x10740-10755 / %x10760-10767 /
        %x10800-10805 / %x10808 / %x1080A-10835 / %x10837-10838 /
        %x1083C / %x1083F-10855 / %x10860-10876 / %x10880-1089E /
        %x108E0-108F2 / %x108F4-108F5 / %x10900-10915 / %x10920-10939 /
        %x10980-109B7 / %x109BE-109BF / %x10A00 / %x10A10-10A13 /
        %x10A15-10A17 / %x10A19-10A35 / %x10A60-10A7C / %x10A80-10A9C /
        %x10AC0-10AC7 / %x10AC9-10AE4 / %x10B00-10B35 / %x10B40-10B55 /
        %x10B60-10B72 / %x10B80-10B91 / %x10C00-10C48 / %x10C80-10CB2 /
        %x10CC0-10CF2 / %x10D00-10D23 / %x10E80-10EA9 / %x10EB0-10EB1 /
        %x10F00-10F1C / %x10F27 / %x10F30-10F45 / %x10FB0-10FC4 /
        %x10FE0-10FF6 / %x11003-11037 / %x11083-110AF / %x110D0-110E8 /
        %x11103-11126 / %x11144 / %x11147 / %x11150-11172 /
        %x11176 / %x11183-111B2 / %x111C1-111C4 / %x111DA /
        %x111DC / %x11200-11211 / %x11213-1122B / %x11280-11286 /
        %x11288 / %x1128A-1128D / %x1128F-1129D / %x1129F-112A8 /
        %x112B0-112DE / %x11305-1130C / %x1130F-11310 / %x11313-11328 /
        %x1132A-11330 / %x11332-11333 / %x11335-11339 / %x1133D /
        %x11350 / %x1135D-11361 / %x11400-11434 / %x11447-1144A /
        %x1145F-11461 / %x11480-114AF / %x114C4-114C5 / %x114C7 /
        %x11580-115AE / %x115D8-115DB / %x11600-1162F / %x11644 /
        %x11680-116AA / %x116B8 / %x11700-1171A / %x11800-1182B /
        %x118A0-118DF / %x118FF-11906 / %x11909 / %x1190C-11913 /
        %x11915-11916 / %x11918-1192F / %x1193F / %x11941 /
        %x119A0-119A7 / %x119AA-119D0 / %x119E1 / %x119E3 /
        %x11A00 / %x11A0B-11A32 / %x11A3A / %x11A50 /
        %x11A5C-11A89 / %x11A9D / %x11AC0-11AF8 / %x11C00-11C08 /
        %x11C0A-11C2E / %x11C40 / %x11C72-11C8F / %x11D00-11D06 /
        %x11D08-11D09 / %x11D0B-11D30 / %x11D46 / %x11D60-11D65 /
        %x11D67-11D68 / %x11D6A-11D89 / %x11D98 / %x11EE0-11EF2 /
        %x11FB0 / %x12000-12399 / %x12480-12543 / %x13000-1342E /
        %x14400-14646 / %x16800-16A38 / %x16A40-16A5E / %x16AD0-16AED /
        %x16B00-16B2F / %x16B40-16B43 / %x16B63-16B77 / %x16B7D-16B8F /
        %x16E40-16E7F / %x16F00-16F4A / %x16F50 / %x16F93-16F9F /
        %x16FE0-16FE1 / %x16FE3 / %x17000-187F7 / %x18800-18CD5 /
        %x18D00-18D08 / %x1B000-1B11E / %x1B150-1B152 / %x1B164-1B167 /
        %x1B170-1B2FB / %x1BC00-1BC6A / %x1BC70-1BC7C / %x1BC80-1BC88 /
        %x1BC90-1BC99 / %x1D400-1D454 / %x1D456-1D49C / %x1D49E-1D49F /
        %x1D4A2 / %x1D4A5-1D4A6 / %x1D4A9-1D4AC / %x1D4AE-1D4B9 /
        %x1D4BB / %x1D4BD-1D4C3 / %x1D4C5-1D505 / %x1D507-1D50A /
        %x1D50D-1D514 / %x1D516-1D51C / %x1D51E-1D539 / %x1D53B-1D53E /
        %x1D540-1D544 / %x1D546 / %x1D54A-1D550 / %x1D552-1D6A5 /
        %x1D6A8-1D6C0 / %x1D6C2-1D6DA / %x1D6DC-1D6FA / %x1D6FC-1D714 /
        %x1D716-1D734 / %x1D736-1D74E / %x1D750-1D76E / %x1D770-1D788 /
        %x1D78A-1D7A8 / %x1D7AA-1D7C2 / %x1D7C4-1D7CB / %x1E100-1E12C /
        %x1E137-1E13D / %x1E14E / %x1E2C0-1E2EB / %x1E800-1E8C4 /
        %x1E900-1E943 / %x1E94B / %x1EE00-1EE03 / %x1EE05-1EE1F /
        %x1EE21-1EE22 / %x1EE24 / %x1EE27 / %x1EE29-1EE32 /
        %x1EE34-1EE37 / %x1EE39 / %x1EE3B / %x1EE42 /
        %x1EE47 / %x1EE49 / %x1EE4B / %x1EE4D-1EE4F /
        %x1EE51-1EE52 / %x1EE54 / %x1EE57 / %x1EE59 /
        %x1EE5B / %x1EE5D / %x1EE5F / %x1EE61-1EE62 /
        %x1EE64 / %x1EE67-1EE6A / %x1EE6C-1EE72 / %x1EE74-1EE77 /
        %x1EE79-1EE7C / %x1EE7E / %x1EE80-1EE89 / %x1EE8B-1EE9B /
        %x1EEA1-1EEA3 / %x1EEA5-1EEA9 / %x1EEAB-1EEBB / %x20000-2A6DD /
        %x2A700-2B734 / %x2B740-2B81D / %x2B820-2CEA1 / %x2CEB0-2EBE0 /
        %x2F800-2FA1D / %x30000-3134A
        ; 131241 codepoints in total


; unicode codepoints from categories Nd, Nl
numbers = %x30-39 / %x660-669 / %x6F0-6F9 / %x7C0-7C9 /
        %x966-96F / %x9E6-9EF / %xA66-A6F / %xAE6-AEF /
        %xB66-B6F / %xBE6-BEF / %xC66-C6F / %xCE6-CEF /
        %xD66-D6F / %xDE6-DEF / %xE50-E59 / %xED0-ED9 /
        %xF20-F29 / %x1040-1049 / %x1090-1099 / %x16EE-16F0 /
        %x17E0-17E9 / %x1810-1819 / %x1946-194F / %x19D0-19D9 /
        %x1A80-1A89 / %x1A90-1A99 / %x1B50-1B59 / %x1BB0-1BB9 /
        %x1C40-1C49 / %x1C50-1C59 / %x2160-2182 / %x2185-2188 /
        %x3007 / %x3021-3029 / %x3038-303A / %xA620-A629 /
        %xA6E6-A6EF / %xA8D0-A8D9 / %xA900-A909 / %xA9D0-A9D9 /
        %xA9F0-A9F9 / %xAA50-AA59 / %xABF0-ABF9 / %xFF10-FF19 /
        %x10140-10174 / %x10341 / %x1034A / %x103D1-103D5 /
        %x104A0-104A9 / %x10D30-10D39 / %x11066-1106F / %x110F0-110F9 /
        %x11136-1113F / %x111D0-111D9 / %x112F0-112F9 / %x11450-11459 /
        %x114D0-114D9 / %x11650-11659 / %x116C0-116C9 / %x11730-11739 /
        %x118E0-118E9 / %x11950-11959 / %x11C50-11C59 / %x11D50-11D59 /
        %x11DA0-11DA9 / %x12400-1246E / %x16A60-16A69 / %x16B50-16B59 /
        %x1D7CE-1D7FF / %x1E140-1E149 / %x1E2F0-1E2F9 / %x1E950-1E959 /
        %x1FBF0-1FBF9
        ; 886 codepoints in total


; unicode codepoints from categories Mn, Mc
combining_marks = %x300-36F / %x483-487 / %x591-5BD / %x5BF /
        %x5C1-5C2 / %x5C4-5C5 / %x5C7 / %x610-61A /
        %x64B-65F / %x670 / %x6D6-6DC / %x6DF-6E4 /
        %x6E7-6E8 / %x6EA-6ED / %x711 / %x730-74A /
        %x7A6-7B0 / %x7EB-7F3 / %x7FD / %x816-819 /
        %x81B-823 / %x825-827 / %x829-82D / %x859-85B /
        %x8D3-8E1 / %x8E3-903 / %x93A-93C / %x93E-94F /
        %x951-957 / %x962-963 / %x981-983 / %x9BC /
        %x9BE-9C4 / %x9C7-9C8 / %x9CB-9CD / %x9D7 /
        %x9E2-9E3 / %x9FE / %xA01-A03 / %xA3C /
        %xA3E-A42 / %xA47-A48 / %xA4B-A4D / %xA51 /
        %xA70-A71 / %xA75 / %xA81-A83 / %xABC /
        %xABE-AC5 / %xAC7-AC9 / %xACB-ACD / %xAE2-AE3 /
        %xAFA-AFF / %xB01-B03 / %xB3C / %xB3E-B44 /
        %xB47-B48 / %xB4B-B4D / %xB55-B57 / %xB62-B63 /
        %xB82 / %xBBE-BC2 / %xBC6-BC8 / %xBCA-BCD /
        %xBD7 / %xC00-C04 / %xC3E-C44 / %xC46-C48 /
        %xC4A-C4D / %xC55-C56 / %xC62-C63 / %xC81-C83 /
        %xCBC / %xCBE-CC4 / %xCC6-CC8 / %xCCA-CCD /
        %xCD5-CD6 / %xCE2-CE3 / %xD00-D03 / %xD3B-D3C /
        %xD3E-D44 / %xD46-D48 / %xD4A-D4D / %xD57 /
        %xD62-D63 / %xD81-D83 / %xDCA / %xDCF-DD4 /
        %xDD6 / %xDD8-DDF / %xDF2-DF3 / %xE31 /
        %xE34-E3A / %xE47-E4E / %xEB1 / %xEB4-EBC /
        %xEC8-ECD / %xF18-F19 / %xF35 / %xF37 /
        %xF39 / %xF3E-F3F / %xF71-F84 / %xF86-F87 /
        %xF8D-F97 / %xF99-FBC / %xFC6 / %x102B-103E /
        %x1056-1059 / %x105E-1060 / %x1062-1064 / %x1067-106D /
        %x1071-1074 / %x1082-108D / %x108F / %x109A-109D /
        %x135D-135F / %x1712-1714 / %x1732-1734 / %x1752-1753 /
        %x1772-1773 / %x17B4-17D3 / %x17DD / %x180B-180D /
        %x1885-1886 / %x18A9 / %x1920-192B / %x1930-193B /
        %x1A17-1A1B / %x1A55-1A5E / %x1A60-1A7C / %x1A7F /
        %x1AB0-1ABD / %x1ABF-1AC0 / %x1B00-1B04 / %x1B34-1B44 /
        %x1B6B-1B73 / %x1B80-1B82 / %x1BA1-1BAD / %x1BE6-1BF3 /
        %x1C24-1C37 / %x1CD0-1CD2 / %x1CD4-1CE8 / %x1CED /
        %x1CF4 / %x1CF7-1CF9 / %x1DC0-1DF9 / %x1DFB-1DFF /
        %x20D0-20DC / %x20E1 / %x20E5-20F0 / %x2CEF-2CF1 /
        %x2D7F / %x2DE0-2DFF / %x302A-302F / %x3099-309A /
        %xA66F / %xA674-A67D / %xA69E-A69F / %xA6F0-A6F1 /
        %xA802 / %xA806 / %xA80B / %xA823-A827 /
        %xA82C / %xA880-A881 / %xA8B4-A8C5 / %xA8E0-A8F1 /
        %xA8FF / %xA926-A92D / %xA947-A953 / %xA980-A983 /
        %xA9B3-A9C0 / %xA9E5 / %xAA29-AA36 / %xAA43 /
        %xAA4C-AA4D / %xAA7B-AA7D / %xAAB0 / %xAAB2-AAB4 /
        %xAAB7-AAB8 / %xAABE-AABF / %xAAC1 / %xAAEB-AAEF /
        %xAAF5-AAF6 / %xABE3-ABEA / %xABEC-ABED / %xFB1E /
        %xFE00-FE0F / %xFE20-FE2F / %x101FD / %x102E0 /
        %x10376-1037A / %x10A01-10A03 / %x10A05-10A06 / %x10A0C-10A0F /
        %x10A38-10A3A / %x10A3F / %x10AE5-10AE6 / %x10D24-10D27 /
        %x10EAB-10EAC / %x10F46-10F50 / %x11000-11002 / %x11038-11046 /
        %x1107F-11082 / %x110B0-110BA / %x11100-11102 / %x11127-11134 /
        %x11145-11146 / %x11173 / %x11180-11182 / %x111B3-111C0 /
        %x111C9-111CC / %x111CE-111CF / %x1122C-11237 / %x1123E /
        %x112DF-112EA / %x11300-11303 / %x1133B-1133C / %x1133E-11344 /
        %x11347-11348 / %x1134B-1134D / %x11357 / %x11362-11363 /
        %x11366-1136C / %x11370-11374 / %x11435-11446 / %x1145E /
        %x114B0-114C3 / %x115AF-115B5 / %x115B8-115C0 / %x115DC-115DD /
        %x11630-11640 / %x116AB-116B7 / %x1171D-1172B / %x1182C-1183A /
        %x11930-11935 / %x11937-11938 / %x1193B-1193E / %x11940 /
        %x11942-11943 / %x119D1-119D7 / %x119DA-119E0 / %x119E4 /
        %x11A01-11A0A / %x11A33-11A39 / %x11A3B-11A3E / %x11A47 /
        %x11A51-11A5B / %x11A8A-11A99 / %x11C2F-11C36 / %x11C38-11C3F /
        %x11C92-11CA7 / %x11CA9-11CB6 / %x11D31-11D36 / %x11D3A /
        %x11D3C-11D3D / %x11D3F-11D45 / %x11D47 / %x11D8A-11D8E /
        %x11D90-11D91 / %x11D93-11D97 / %x11EF3-11EF6 / %x16AF0-16AF4 /
        %x16B30-16B36 / %x16F4F / %x16F51-16F87 / %x16F8F-16F92 /
        %x16FE4 / %x16FF0-16FF1 / %x1BC9D-1BC9E / %x1D165-1D169 /
        %x1D16D-1D172 / %x1D17B-1D182 / %x1D185-1D18B / %x1D1AA-1D1AD /
        %x1D242-1D244 / %x1DA00-1DA36 / %x1DA3B-1DA6C / %x1DA75 /
        %x1DA84 / %x1DA9B-1DA9F / %x1DAA1-1DAAF / %x1E000-1E006 /
        %x1E008-1E018 / %x1E01B-1E021 / %x1E023-1E024 / %x1E026-1E02A /
        %x1E130-1E136 / %x1E2EC-1E2EF / %x1E8D0-1E8D6 / %x1E944-1E94A /
        %xE0100-E01EF
        ; 2282 codepoints in total

Christian Siefkes · Answer 23 · Mon Dec 16 2019 21:00:59 GMT+0800 (China Standard Time)

@marzer Great!

@pradyunsg Assuming that one of us prepares a PR, it there any change that this would be merged relatively quickly? Or does it have to wait until 1.0 is released in any case?

Alexey Sergin · Answer 24 · Mon Dec 16 2019 21:33:47 GMT+0800 (China Standard Time)

This feels like a significant change to TOMLs interpretation of being "minimal". Maybe we should ask Tom himself to bless this change?

Mark Gillard · Answer 25 · Mon Dec 16 2019 21:49:18 GMT+0800 (China Standard Time)

Is it though? The language itself will be just as minimal as before, since this change will be backwards-compatible. In fact it would actually increase the simplicity of TOML files since keys should work in a WYSIWYG way for more people, and only require quotes in very specific circumstances.

It will complicate it for implementers, sure, but not all that much.

Alexey Sergin · Answer 26 · Mon Dec 16 2019 22:04:36 GMT+0800 (China Standard Time)

@marzer, i have no objections against unicode version of this proposal. However i do feel that change is too significant to apply it without asking Tom.

Dave Ostroske · Answer 27 · Tue Dec 17 2019 11:23:58 GMT+0800 (China Standard Time)

I can't help but think that there must be a simpler way of expressing all this. Sorry to be a wet blanket, especially after the great work that everyone here's put into making this happen.

ABNF wasn't defined for use with Unicode. The Unicode folks offer guidelines for adding Unicode support to regex engines, including character classes, but ABNF as it stands doesn't provide a way to use character classes like this.

What would it take to augment ABNF with general Unicode categories? Something like \p{Lu}, or \p{Uppercase Letter}, would make toml.abnf readable, not to mention compatible with newer Unicode versions as they are released.

Or we could just cop-out and do what Scala did, but in ABNF.

upper            ::=  ‘A’ | … | ‘Z’ | ‘$’ | ‘_’  // and Unicode category Lu
lower            ::=  ‘a’ | … | ‘z’ // and Unicode category Ll
letter           ::=  upper | lower // and Unicode categories Lo, Lt, Nl

I'm fussing over this because I'm not keen on suddenly doubling the size of our ABNF file with a collection of character sets that are better defined somewhere else.

Christian Siefkes · Answer 28 · Wed Dec 18 2019 01:22:57 GMT+0800 (China Standard Time)

@eksortso: That's an interesting idea – I agree that it would be much nicer to somehow reference the Unicode categories in the ABNF file. Assuming the regex syntax were adopted as a kind of unofficial
ABNF extension, the proposed new syntax for unquoted keys could be expressed very succinctly:

unquoted-key = ukey_start *ukey_continued
ukey-start = \p{L} / \p{Nd} / \p{Nl} / "_" / "-"
ukey-continued = ukey_start / \p{Mc} / \p{Mn}

An alternative, more in line with how ABNF generally works, is to assume a series of predefined UNICODE_CATEGORY_X and UNICODE_CATEGORY_XX rules, complementing the ASCII-only core rules that are already defined by the standard (such as ALPHA, DIGIT, HEXDIG). This would allow us to write

unquoted-key = ukey_start *ukey_continued
ukey-start = UNICODE_CATEGORY_L / UNICODE_CATEGORY_ND / UNICODE_CATEGORY_NL / "_" / "-"
ukey-continued = ukey_start / UNICODE_CATEGORY_MC / UNICODE_CATEGORY_MN

This would be a syntactically valid ABNF file, but, of course, the parser would complain about undefined rules.

To possibly address this, we could add a second ABNF, unicode_categories.abnf, autogenerated by a variant of @marzer 's script and re-generated whenever a new version of Unicode has been released (we should add the script to the TOML repository as well). In theory, it could list all categories, but for practical purposes, it's probably better to just generate and list the handful of rules that are actually used in the toml.abnf.

Anyone who wants to actually use the ABNF for some purposes can then concatenated the two files before handing them over to the ABNF parser. (I haven't seen any kind of "include" command in ABNF). Those who just want to understand the details of TOML's syntax model, can read the toml.abnf without having to wade through all the Unicode stuff.

Abel Braaksma · Answer 29 · Thu Dec 19 2019 21:30:12 GMT+0800 (China Standard Time)

After reading this discussion, it seems to start to boil down to what looks fairly close to what is allowed in XML names, which in turn is what is allowed in HTML tag-names and @id/@name targets, which are fairly well understood.

I would, therefore, like to propose to simply take the definition of NmToken from the XML specification. The difference Name is that it allows starting with digits, which is allowed in TOML.

Here's the EBNF (not ABNF, but it is easy to translate/understand), which is pretty close to earlier suggestions above (the long lists in the posts), but with the addition of unassigned ranges. See the argument below for that.

/* interpreted from NameStartChar and NameChar */
NameChar := "-" | "." | [0-9] | ":"                ; we should remove colon and dot
            | [A-Z] | "_" | [a-z] | #xB7           ; xB7 is MIDDLE DOT
            | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF]
            | [#x300-#x37D] | [#x37F-#x1FFF]       ; this excludes GREEK QUESTION MARK 
            | [#x200C-#x200D] | [#x203F-#x2040] | [#x2070-#x218F] 
            | [#x2C00-#x2FEF] | [#x3001-#xD7FF]
            | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] 
            | [#x10000-#xEFFFF]

NmToken := (NameChar)+

We should probably remove the "." and the colon ":" from the definition

Motivation

I can see a bunch of advantages for copying (most of) this definition:

Many people are accustomed to HTML and/or XML or have seen #name identifiers, they all obey to these rules and it helps adoption.
It follows the well-known adage: be liberal in what you accept.
Most implementing languages have XML support and therefore, have a function to parse a name (i.e., in the .NET world that is XName).
- Take your favorite library's XML Name parser
- Check for dot and colon (not allowed)
- Or just manually check for the ranges above (also trivial)
The definition is fairly close to the earlier referenced RFC-3987, though it is more permissive
It is resilient against normalization
Having a handful of ranges is much simpler than the complex lists of the Unicode Categories, and makes it easier to support TOML in languages that have a minimal understanding of Unicode and/or buggy Categories support (.NET, for instance, has many issues in different versions related to this).
It explicitly disallows x7B - xBF (allowed in comments above), because these ASCII ranges are used as control characters on many systems (also, there's little in there that really warrants inclusion anyway)
The inclusion of unassigned characters allows TOML to be future-proof
Being independent of Unicode categories has the advantage of being independent of Unicode versions. Unfortunately, the Unicode consortium has re-assigned and moved characters from one category to another. Making the proposals as above being dependent on categories are therefore subject to whatever version of Unicode your implementation supports, and undoubtedly can lead to ambiguities
The definition above excludes "the ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters"
It excludes x037E, GREEK QUESTION MARK, because it gets normalized to a semicolon (do note that @marzer also excluded this).

I personally find the ease of implementation and the non-ambiguous future- and history-proof fixed ranges to be a large advantage.

Motivation in XML's words

Copying here so you don't have to go to the XML spec, but I found these arguments pretty conclusive and clear:

Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names. See J Suggestions for XML Names for suggestions on the creation of names.

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

(we should remove COLON and FULL STOP, obviously)

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.

Christian Siefkes · Answer 30 · Fri Dec 20 2019 22:41:35 GMT+0800 (China Standard Time)

@abelbraaksma That's an interesting idea. It would certainly simplify ABNF handling a lot. Nevertheless, conceptually I find

"arbitrary Unicode letters and digits as well as ASCII underscores (_) and dashes (-)"

much clearer than

"almost an XMLName, except that leading digits are allowed and a few characters have been excluded"

Everybody knows what letters and digits are, even though nobody will know all the letters that exist in various scripts. But who knows what exactly is and not allowed in an XMLName? I suspect not even the creators of XML (without looking it up).

Also, since they don't know which Unicode characters are marks (combining characters) and which are not (see discussion above), it seems that they allow an XMLName to start with a mark, i.e. a character that graphically fuses with and modifies the character that preceeds it – in front of the XMLName. Of course, well-behaving authors won't use such names, but it's still a bit odd.

Also, while I didn't check exact ranges, I suppose that most of Close Punctuation – some of which looks quite similar to the square brackets which terminate a table name – and Other Punctuation – some of which looks quite similar to the dots separating key parts in TOML – is allowed. Again, people obviously don't have to use such dubious characters, but I still would prefer not to allow them in bare keys.

A final minor note: If we go down that route (I'm not strictly against it, though I have my reservations as noted), we should certainly exclude the middle dot as well – since it's a singleton in the listed ranges, that's trivial to do.

Abel Braaksma · Answer 31 · Sat Dec 21 2019 01:57:34 GMT+0800 (China Standard Time)

@ChristianSi, thanks for even considering it :). Allow me to try to address your concerns from my viewpoint.

much clearer than

It is a matter of communication. Your first statement is indeed clearer. A common way to refer to the XML name production is to say "Any Unicode letter, letterlike character or digit, except dot".

The precise production is not relevant, it is permissive by design, and creates a "principle of least surprise" for users. However you write a word in your relevant language or writing system, is allowed as a keyword.

nobody will know all the letters that exist in various scripts

Indeed, but that is true for any proposal in this thread so far. But is that relevant? All you need to know is a subset, and the rule is that you can use whatever subset you are comfortable with in your preferred language.

I suspect not even the creators of XML (without looking it up).

I am part of the group that maintained XML, and we all have to look it up from time to time. But, when writing code, we never have to bother to look it up: what we feel as correct, usually is syntactically correct as well.

since they don't know which Unicode characters are marks (combining characters) and which are not it seems that they allow an XMLName to start with a mark

A bit yes and no. I accidentally copied the main diacritical marks section [#x300-#x37D] to the production above, but they are not allowed as start character in XML and we should probably disallow it as well.

Since we cannot possibly predict the natural language used by programmers — be it from China, Taiwan , Swaziland or even abugida languages like Thai or Bengali — we shouldn't limit the ability to write in their native tongue. If you are Tamil and you have to write without diacritics (and other combining marks) it becomes illegible. But none of these languages will start with a combining mark, so that is safe to exclude.

a character that graphically fuses with and modifies the character that precedes it – in front of the XMLName.

While true, we should not consider what "graphically fuses", as that is a subject about rendering Unicode fonts. How it "looks" is irrelevant for how it parses. A parser operates on codepoints, not graphemes, clusters or byte sequences.

We should, however, make a note on normalization (regardless what Unicode subset we're going to allow). Consider ñaña, a perfectly valid word in Spanish. This can be written 0x6E 0x303 (̃ñ, two codepoints, n followed by combining mark) or 0xF1 (ñ, one codepoint). These two render slightly differently, but should mean the same. Another example is in Dutch (my language), where ij (two codepoints i and j) means the same as ĳ (0x133, one codepoint). Whether or not we should say so in the spec, I would suggest that parsers follow NFC (or NFD), so that the chosen editor will not effect behavior.

An alternative is to make normalization not a mandatory thing, but just to make a note in the docs that says that it may affect identifiers. This is, btw, true for any programming (and non-programming) language out there and JSON is also not immune to normalization issues.

I suppose that most of Close Punctuation – some of which looks quite similar to the square brackets which terminate a table name – and Other Punctuation

Some are included and some aren't. The ones you find on your keyboard are definitely excluded. The ones that most resemble often-used closing/opening punctuation are also excluded (like 0x5D or ], and 0x27E9, or ⟩.

I checked further, and indeed, with this rule in place, this is valid: footnote₍₁₎, and this too: Yig-༒-FurSat-༆ . But is that problematic? If authors really want it, they can (and these ranges contain valid characters in some languages), but generally, I think they'll stay away from it anyway. It follows my personal favorite: be liberal in what you accept.

Again, people obviously don't have to use such dubious characters, but I still would prefer not to allow them in bare keys.

That's almost a philosophical question and much debate has preceded the current one (I've been a member of the XML working group at W3.org for 8 years). In .NET, the CLR accepts any string as identifier. But C# chose a more restrictive set. F#, in contrast, allows (almost) any character.

The more restrictive you are, the harder it is to learn what is allowed and why. Personally, the ability to write something that looks like, but not quite is a separating token in TOML, is a strong argument in favor of allowing this. If you don't want to complicate things in your own scripts, that's good, but should you disallow it for others? They'll always try to find way to express clearly what they want.

For instance, I like to use the @-sign in keys and identifiers, because it has special meaning in my context (specifies that it applies to attributes). However, the normal @-sign is disallowed in my language of choice, F#. So I use ＠ (0xFF20), which makes my code more readable than writing at-Something.

If the F# designers had the same idea as you: if it looks like something we currently disallow, then we should disallow it altogether, then they'd have to disallow anything that looks like reserved characters in their view. And then we haven't even discussed ligatures: we can always make something look like something forbidden.

we should certainly exclude the middle dot as well

I have no strong opinion one way or the other.

In conclusion: sorry for the long post ;). There's three more things I'd like to add, though:

Being permissive is usually better than being restrictive in these cases
Speed: this parses much, much faster than character classes or categories
If you do chose to adopt the XMLNames syntax (or a variant), then you get as a bonus that TOML will be roundtrippable to XML, where identifiers can be used as element names. Maybe not your primary goal, but it's a nice extra ;)

Christian Siefkes · Answer 32 · Tue Dec 24 2019 00:44:33 GMT+0800 (China Standard Time)

@abelbraaksma: You make a strong case – consider me semi-convinced. For me, the important thing is that people should be able to use words in arbitrary languages as a kind of "bareword", without having to quote them. Whether we realize that goal by allowing "arbitrary Unicode letters and digits" or XMLNames is less important – both would work. Hence it might more or less boil down to a question of what's easier to implementers. I wonder that others here think?

Regarding ranges full of marks, you mention that "they are not allowed as start character in XML and we should probably disallow it as well." Maybe you could investigated that further? It would be interesting to see which ranges are prohibited in NameStartChar, but allowed in NameChar – and why. If they are full of non-ASCII digits, we might want to allow them even at the start, but if they are full of marks, we certainly don't.

If you do chose to adopt the XMLNames syntax (or a variant), then you get as a bonus that TOML will be roundtrippable to XML, where identifiers can be used as element names.

Not all, since quoted keys (which allow arbitrary characters) are legal as well and will not go away. But in any case I think we agree that that's not the main issue.

Normalization is an altogether different issue, since it also concerns quoted keys, so its relevance to the TOML spec is unrelated to the question of whether or not additional characters are allowed in the unquoted variety. Maybe you could open an new issue to discuss it?

Mark Gillard · Answer 33 · Tue Dec 24 2019 01:50:41 GMT+0800 (China Standard Time)

@ChristianSi

Hence it might more or less boil down to a question of what's easier to implementers. I wonder that others here think?

Having just implemented the unicode group-based approach in my parser I can attest that the more permissive set of character ranges suggested by @abelbraaksma would have been easier, though now that the work is done I don't need to revisit it (except to incorporate updates to the Unicode standard, but that amounts to running a script once).

@abelbraaksma's point about varying levels of correctness among other unicode implementations is a good one, though.

Jesse Johnson · Answer 34 · Tue Feb 18 2020 05:08:59 GMT+0800 (China Standard Time)

I'm coming to this as someone who is incorporating TOML into a project with keys that will often contain symbols/punctuation. I've read through this thread and I have not seen anyone propose that keys allow any valid unicode except the symbols needed by the TOML parser itself. That would:

maximize permissiveness
create an easy and small set of rules for when people need to quote keys (if the key contains space, dot, brackets, quotes, etc, then it must be quoted)
not strictly require a unicode library for a parser
Likely be faster than other options to parse

I'm not currently arguing this is the best approach but it seemed worth adding to the set of options in the discussion space.

Mark Gillard · Answer 35 · Sun Feb 23 2020 03:56:33 GMT+0800 (China Standard Time)

If anyone is interested in playing around with a parser that supports this tentative feature (as specified in the OP, anyways), my C++ TOML library is now in a publishable state: https://marzer.github.io/tomlplusplus/

@thoughtafter It seems as though your suggestion is very much in-line with @abelbraaksma's (which from my reading, advocates including everything except syntactically-relevant/ambiguous characters).

龙腾道 · Answer 36 · Sun Apr 12 2020 05:37:13 GMT+0800 (China Standard Time)

In my opinion, we don't program in any language, including English. What we are coding is symbol. ASCII in programming is safe symbols, not English.

In high level languages, identifier could be defined as any charactor, because here is IDE and highlight.

But TOML is designed for ini file, usually no any extra support when editing.
That's also why (and the mainly reason why TOML exists) TOML is better than YAML—because we can't indent/deindent easily, for nested values or multiline strings.
In any other language, indent is better than non-indent design, we all know that.

So I think that's really dangerous to allow bare keys include non-ASCII charactor.

If we support complex charactor range under sematic of Unicode, like JS (also HTML) does, then all languages users need bear the mental burden of all other languages which they not use and unfamiliar. It's not minimal.
If we support all non-ASCII as identifier, like CSS does, then there could be a lot of invisible characters maybe written in key, which is not obvious. As ini files are usally used in low level symtem program config, it's so dangerous.

But, I think spec allow implementations to support user specified language bare keys support is good. What languages you are fimilar, you use that. For example, /[\u4e00-\u9fa5]/ is Chinese, so it can be easily supported and then easy to write bare key, and, safe. But who you know and care? But as specific language user, I know, I can pass range argument as an options value to parser, preserving highly controlled at the same time.

I think simplicity and nationality could be no conflict, not must one or the other—otherwise, absolute "fairness" will lead to widespread inefficiency.

Mark Gillard · Answer 37 · Sun Apr 12 2020 05:56:35 GMT+0800 (China Standard Time)

@LongTengDao it's a config file format, not a database specification or real-time streaming format; I don't think 'efficiency' is all that relevant (if you mean the computational complexity of parsing, that is).

Unless you mean the efficiency of the actual implementing of the new functionality? As in, it will be a bit complex for implementers and maintainers to get this working in their parsers, thus being inefficient for them? If so, that's not even true. It's pretty easy to implement. I've done it myself, and provide relevant information in the original post.

I'm not sure what other sorts of efficiencies you could mean. It wouldn't make TOML any less efficient to write (if anything it would get simpler and easier to use as a result of this proposal).

龙腾道 · Answer 38 · Sun Apr 12 2020 07:12:10 GMT+0800 (China Standard Time)

@marzer I've never considered the difficulty of writing a parser is a hindrance, and it's not worth considering in the face of a perfectly formatted file design task. If anyone objects to this, I will be on your side.

I only mean the efficiency of writing and checking. Introducing special characters too broadly will make the process of reading and writing a file stressful again. Remember, Unicode doesn't just include characters in common languages like the ones you and me use (1en or 1em width). Instead, there are many combinations of display and invisible even right-to-left characters that are in character category, rather than punctuation or whitespace category. It's a nightmare, if you've ever developed an typesetting software like Office Word. After that, I have been frightened by words like "all valid Unicode".

But anyway, it doesn't affect my use. If spec said ASCII only, I will support user options to support any Unicode character range. If spec said any Unicode character is valid as your suggestion, I will support user options to limit ASCII only. I think this right belongs to user.

Mark Gillard · Answer 39 · Sun Apr 12 2020 15:24:27 GMT+0800 (China Standard Time)

If spec said any Unicode character is valid as your suggestion

@LongTengDao to be clear, my proposal isn't to support "any Unicode character", as you seem to think. It's to support a subset (letters, numbers, and some combining marks).

Yeah there might be characters in those categories that are effectively garbage for our purposes but they can probably just be ignored; if it's not a character on a keyboard then someone has gone to effort to put it in their config, and if that breaks stuff then that's the life they chose. Parser library users can trivially add additional sanity-checking if they feel the need.

Abel Braaksma · Answer 40 · Thu Jul 02 2020 23:16:00 GMT+0800 (China Standard Time)

Instead, there are many combinations of display and invisible even right-to-left characters that are in character category, rather than punctuation or whitespace category.

@LongTengDao I believe the opposite to be true. Limiting users that are accustomed to right-to-left writing means limiting over 1.7 billion people worldwide to a system that is not native to them. What is perhaps perceived by you as "a nightmare" is perceived by others as a nightmare if it isn't allowed. Not everyone speaks English or can write in their native tongue using only ASCII characters (in fact, it is a relative small share of the world population).

Inclusion of other cultures, languages and writing systems is a good thing, and although TOML is not a programming language, many well-known programming language embrace inclusion more than exclusion: C#, VB, F# (allows any character), Java (they allow a broader set than defined here), Ruby, Perl, XML/HTML tag names, CSS classes/id and there are many more.

Unicode even has a specific TR that describes the recommended way for allowing Unicode characters in identifiers: https://unicode.org/reports/tr31/.

Differences between languages will always exist, but the closer a language (or a spec like TOML) gets to TR31, the better it is for the worldwide community of thousands of languages that can write in their native tongue.

If any company or individual wishes to limit the allowed set of characters in identifiers, or in coding in general, they are of course free to do so, just like coding styles exist for many programming languages, you could limit your style to "only ASCII" or whatever you prefer.

And as already has been said, the proposal here is a safe subset of the Unicode language.

Abel Braaksma · Answer 41 · Thu Jul 02 2020 23:29:46 GMT+0800 (China Standard Time)

Regarding ranges full of marks, you mention that "they are not allowed as start character in XML and we should probably disallow it as well." Maybe you could investigated that further? It would be interesting to see which ranges are prohibited in NameStartChar, but allowed in NameChar – and why. If they are full of non-ASCII digits, we might want to allow them even at the start, but if they are full of marks, we certainly don't.

@ChristianSi, apologies for the wait, I forgot about your question here.

The precise definition of NameChar is that of NameStartChar with a few additions. These additions are therefore not allowed as a starting character:

NameChar ::=  NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

Let's split that up. According to this, the following are not allowed as a starting character (and I believe this mostly follows current practice in TOML as well):

The minus sign - and the period .
Any ASCII digit 0-9
Middle dot xB7, or ·
x300 - x36F, these are the combining diacritical marks, which means you cannot start with a combining acute or circumflex for instance, see https://en.wikipedia.org/wiki/Combining_Diacritical_Marks. I think this makes sense.
The "tie", x203F and x2040, or ‿ (under tie) ⁀ (character tie), https://en.wikipedia.org/wiki/Tie_(typography)

I'm not sure why the "tie" is forbidden as starting character in XML names (it is not a combining tie, it is spaced), but the other ones seem sensible.

I can write up a TOML spec proposal for this set, and/or extend it to TR31 if that somehow makes sense, but I think it is easier for people in general to use the XML specification (without reference to XML, of course, as it is otherwise unrelated), since they already did the necessary research, it's concise, and it's trivial to implement. TR31 is quite hard to read and probably raises new questions again.

Pradyun Gedam · Answer 42 · Sat Mar 05 2022 20:58:29 GMT+0800 (China Standard Time)

This looks like we're missing a PR for doing this. If someone wants to pick this up, and file a PR expanding the allowed bare keys syntax to include letters from the broader unicode spec, that'd be welcome!

Abel Braaksma · Answer 43 · Tue Mar 15 2022 20:22:00 GMT+0800 (China Standard Time)

@pradyunsg, done, I've created a PR in #891. I tried to be both as inclusive as possible, while maintaining simplicity for parsers. Basically the rule is now: "Any Unicode letter, letterlike character or digit, except dot", as discussed above.

Abel Braaksma · Answer 44 · Mon Sep 12 2022 22:19:58 GMT+0800 (China Standard Time)

This has now been merged. Thanks everyone for their support and insights!