k-takata / Onigmo

Onigmo is a regular expressions library forked from Oniguruma.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support Unicode script extensions

747 opened this issue · comments

commented

Is there any plan to support the script extensions (scx) property, which allows characters to have non-singular script identities?
It has been available in many dynamic languages such as Perl, PHPPython, JavaScript (recently) etc., and would greatly improve the usefulness against the real-world text.

For example, in JS after ES2018:

// match by script (= Ruby /[\p{Hani}\p{Hira}\p{Kana}]+/)
"ア行〜タ行のデータ".match(/[\p{sc=Hani}\p{sc=Hira}\p{sc=Kana}]+/gu);
// => [ "ア行", "タ行のデ", "タ" ]

// match by script_extensions
"ア行〜タ行のデータ".match(/[\p{scx=Hani}\p{scx=Hira}\p{scx=Kana}]+/gu);
// => [ "ア行〜タ行のデータ" ]

While not being the silver bullet due to the Unicode complications, it will catch most of the common pitfalls on Unicode script matching. Manually reproducing the equivalent of scx properties with the vanilla script property can often result in a non-trivial expression.

# implement \p{scx=Hira} equivalent
/[\p{Hira}、-〃〈-】〓-〟〰-〵〷〼〽\u3099-゜゠・ー﹅﹆。-・ー゙゚]/

Sorry if already discussed somewhere, but at least I couldn't find a relevant issue in this repository.