LHNCBC / ucum-lhc

LHC implementation of UCUM validation and conversion services

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Proposal: Annotation strings should be allowed to contain Chinese characters and like

linforestzhang opened this issue · comments

In the file unitString.js:
// A regular expression for validating annotation strings.
static VALID_ANNOTATION_REGEX = /^\{[!-z|~]*\}$/;

My Proposal:
Annotation strings should be allowed to contain Chinese characters and like, such as:
{片} ... for {tablet}

Proposed regular expression:
// A regular expression for validating annotation strings.
static VALID_ANNOTATION_REGEX = /^\{[!-z|~\u4e00-\u9fa5]*\}$/;

Validated examples:
{片} ... for {tablet}
{肌酐} for {creat} [ Usage: nmol/mmol{肌酐} ... for nmol/mmol{creat} ]
{蛋白质} for {prot} [ Usage: nmol/mg{蛋白质} ... nmol/mg{prot} ]
...

Maybe this is a proposal for the UCUM Specification per se.

I agree that this needs to be clarified in the UCUM specification. The above regular expression is from the diagram at https://github.com/ucum-org/ucum/blob/main/assets/images/ucum-state-automaton.gif, which is what is supposed to be showing as "Figure 1" under the BNF 2.2 (section 10) of https://ucum.org/ucum. I recommend opening an issue at https://github.com/ucum-org/ucum/issues. It is odd that that regular expression is only in a gif, so I am not sure if they really mean to limit it to just those characters. If your issue gets approved there, please reopen this issue here.

@plynchnlm From the specification (Version: 2.1):

§6 curly braces: The full range of characters 33–126 can be used within a pair of curly braces (‘{’ and ‘ }’). The material enclosed in curly braces is called annotation.

Changing UCUM codes from 7-bit ASCII to UTF-8 would be so disruptive that I personally cannot imagine it to ever happen under the UCUM name. Maybe in a successor with a new name.

@dalito Can you please elaborate on why UTF-8 would be disruptive?

@timbrisc - I was thinking about all legacy code that has been written with the guarantee that UCUM codes are strictly ASCII. So they may have selected database fields accordingly, validate based on that assumption and wrote file handling code without thinking about unicode. In general handling unicode requires more attention which was not required so far.

Nevertheless, allowing unicode in annotations would be great. If ucum stays ascii forever without a path forward, it has probably not a great future. I just think that it must be very clear that unicode-ucum is a new extended version of the standard that breaks earlier guarantees.