Proposal: Annotation strings should be allowed to contain Chinese characters and like

Question

Proposal: Annotation strings should be allowed to contain Chinese characters and like

linforestzhang opened this issue 7 months ago · comments

In the file unitString.js:
// A regular expression for validating annotation strings.
static VALID_ANNOTATION_REGEX = /^\{[!-z|~]*\}$/;

My Proposal:
Annotation strings should be allowed to contain Chinese characters and like, such as:
{片} ... for {tablet}

Proposed regular expression:
// A regular expression for validating annotation strings.
static VALID_ANNOTATION_REGEX = /^\{[!-z|~\u4e00-\u9fa5]*\}$/;

Validated examples:
{片} ... for {tablet}
{肌酐} for {creat} [ Usage: nmol/mmol{肌酐} ... for nmol/mmol{creat} ]
{蛋白质} for {prot} [ Usage: nmol/mg{蛋白质} ... nmol/mg{prot} ]
...

Maybe this is a proposal for the UCUM Specification per se.

Paul Lynch · Answer 1 · Sat Jan 06 2024 00:45:52 GMT+0800 (China Standard Time)

I agree that this needs to be clarified in the UCUM specification. The above regular expression is from the diagram at https://github.com/ucum-org/ucum/blob/main/assets/images/ucum-state-automaton.gif, which is what is supposed to be showing as "Figure 1" under the BNF 2.2 (section 10) of https://ucum.org/ucum. I recommend opening an issue at https://github.com/ucum-org/ucum/issues. It is odd that that regular expression is only in a gif, so I am not sure if they really mean to limit it to just those characters. If your issue gets approved there, please reopen this issue here.

David Linke · Answer 2 · Sun Jan 07 2024 00:36:36 GMT+0800 (China Standard Time)

@plynchnlm From the specification (Version: 2.1):

§6 curly braces: The full range of characters 33–126 can be used within a pair of curly braces (‘{’ and ‘ }’). The material enclosed in curly braces is called annotation.

Changing UCUM codes from 7-bit ASCII to UTF-8 would be so disruptive that I personally cannot imagine it to ever happen under the UCUM name. Maybe in a successor with a new name.

Tim Briscoe · Answer 3 · Sun Jan 07 2024 10:10:14 GMT+0800 (China Standard Time)

@dalito Can you please elaborate on why UTF-8 would be disruptive?

David Linke · Answer 4 · Sun Jan 07 2024 19:05:56 GMT+0800 (China Standard Time)

@timbrisc - I was thinking about all legacy code that has been written with the guarantee that UCUM codes are strictly ASCII. So they may have selected database fields accordingly, validate based on that assumption and wrote file handling code without thinking about unicode. In general handling unicode requires more attention which was not required so far.

Nevertheless, allowing unicode in annotations would be great. If ucum stays ascii forever without a path forward, it has probably not a great future. I just think that it must be very clear that unicode-ucum is a new extended version of the standard that breaks earlier guarantees.