open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust

Home Page:https://crates.io/crates/unic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[unic-ucd-name] Macro for getting Unicode characters by name

clarfonthey opened this issue · comments

Essentially, something like:

const MULTIPLY_DOT: char = named_char!("dot operator");

I've found that in some cases where I've wanted to use less-known unicode characters, that I've resorted to just looking up the escape codes and including these directly in my source. This is error-prone and it'd be a lot easier to catch a name that doesn't match than an escape code that doesn't.

Many systems, specially RegEx engines, allow escaping Unicode characters in the form of '\N{dot operator} or '\u{dot operator}. IIRC, there's already a GH issue files for Rust for this.

In the meanwhile, I guess we can add this to unic-ucd-name, as an optional feature.

unicode_names provides this today.

As part of this Reddit thread, the author of said crate mentioned its compression algorithm. Given #199 and the already shaky compile times for unic/ucd/name, I've been considering basically bringing in unicode_names's table structure, and thus the bidirectionality. I think the reverse-direction data can be optional. Once that exists, it would be trivially possible to create a proc-macro that uses that crate.

I'm assigning this to myself, as a proper resolution to the lurking horror that is #199 will make this simple to add.

To add some additional context for others following this issue. (See this reddit thread for the impetus.)

One possible way to provide the name->char mapping is with finite state transducers. The TL;DR is that an FST is a finite state machine that compactly represents sets of byte strings, and can also represent a simple bytes -> u64 mapping as well. For all Unicode names (even including generated Hangul/Ideograph names and all aliases), the total size on disk and in memory is 230KB. An FST is searchable in its compact form.

Currently, the only required dependency of fst is byteorder (it by default requires the memmap crate, but that can be disabled). It also currently relies on the standard library, but if there's interest, it should definitely be possible to provide a no_std mode (maybe a new fst-core crate, not sure) that can read FSTs but cannot write them.

A key downside of FST is that its compaction comes with slower access times when compared to, say, a standard trie. I suspect the difference isn't too large though, and there's probably room for more optimizations.

commented

any word on this?