BSBMteam / zhconv-rs

Convert Chinese text among Trad/Simp and regional variants | 中文简繁及地區詞轉換

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

docs.rs Crates.io CI status

zhconv-rs 中文简繁及地區詞轉換

zhconv-rs converts Chinese text among several scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), built on the top of zhConversion.php conversion tables from Mediawiki, which is the one also used on Chinese Wikipedia.

Web App: https://zhconv.pages.dev/ (powered by WASM)

Cli: cargo install zhconv-cli or check releases(TODO).

Crate:

[dependencies]
zhconv = "0.1"

Supported variants

Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same conversion table with zh-HK unless additonal rules / CGroups are applied; zh-MY and zh-SG shares the same conversion table withzh-CN unless additional rules / CGroups are applied.

Performance

cargo bench on Intel(R) Xeon(R) CPU @ 2.80GHz (GitPod), without parsing inline conversion rules:

load zh2Hant            time:   [45.442 ms 45.946 ms 46.459 ms]
load zh2Hans            time:   [8.1378 ms 8.3787 ms 8.6414 ms]
load zh2TW              time:   [60.209 ms 61.261 ms 62.407 ms]
load zh2HK              time:   [89.457 ms 90.847 ms 92.297 ms]
load zh2MO              time:   [96.670 ms 98.063 ms 99.586 ms]
load zh2CN              time:   [27.850 ms 28.520 ms 29.240 ms]
load zh2SG              time:   [28.175 ms 28.963 ms 29.796 ms]
load zh2MY              time:   [27.142 ms 27.635 ms 28.143 ms]
zh2TW data54k           time:   [546.10 us 553.14 us 561.24 us]
zh2CN data54k           time:   [504.34 us 511.22 us 518.59 us]
zh2Hant data689k        time:   [3.4375 ms 3.5182 ms 3.6013 ms]
zh2TW data689k          time:   [3.6062 ms 3.6784 ms 3.7545 ms]
zh2Hant data3185k       time:   [62.457 ms 64.257 ms 66.099 ms]
zh2TW data3185k         time:   [60.217 ms 61.348 ms 62.556 ms]
zh2TW data55m           time:   [1.0773 s 1.0872 s 1.0976 s]

Differences between other tools

  • ZhConver{sion,ter}.php of MediaWiki: zhconv-rs are just based on conversion tables listed in ZhConversion.php. MediaWiki relies on the inefficient PHP built-in function strtr. Under the basic mode, zhconv-rs guarantees linear time complexity with single-pass scanning of input text. Optionally, zhconv-rs supports the same conversion rule syntax with MediaWiki.
  • OpenCC: OpenCC has self-maintained conversion tables that are different from MediaWiki. The converter implementation of OpenCC is kinda similar to the aforementioned strtr. zhconv-rs uses the Aho-Corasick algorithm, which would be much faster in general.

All of these implementation shares the same leftmost-longest matching strategy. So conversion results should generally be the same given the same conversion tables.

Credits

All data that powers the converter, including conversion tables and CGroups, comes from the MediaWiki project.

The project takes the following projects/pages as references:

TODO

  • Support Module:CGroup
  • Propogate error properly with Anyhow and thiserror

About

Convert Chinese text among Trad/Simp and regional variants | 中文简繁及地區詞轉換

License:GNU General Public License v3.0


Languages

Language:Rust 55.4%Language:TypeScript 28.9%Language:Python 11.9%Language:HTML 1.8%Language:JavaScript 1.2%Language:CSS 0.8%