whitequark / ocaml-m17n

Multilingualization for the OCaml source code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multilingualization for OCaml source code

The m17n package allows using Unicode identifiers in the OCaml source code.

type =
| @赤
| @黄色
| @緑色
[@@deriving show]

let () = print_endline (show_色 @赤)

You're encouraged to use m17n.

Installation

m17n can be installed via OPAM:

opam install m17n

Note that m17n is not compatible with camlp4.

Usage

m17n can be activated using ocamlfind:

ocamlfind ocamlc -package m17n -syntax utf8 ...

If you are using ocamlbuild, add the following to your _tags file:

<**/*.{ml,mli}>: package(m17n), syntax(utf8)

m17n also works in toplevel as well as utop. It can be activated using:

#require "m17n";;

Instead of the m17n package, language-specific packages may be used. These packages localize the OCaml keywords. The English keywords can still be always used.

The following localization packages are available:

  • m17n.zh_CN

Features

m17n expects the source code to be valid UTF-8. It extends the identifiers normally recognized by OCaml to include all Unicode letters and digits. The case distinction is also preserved; however, the files corresponding to modules with non-English names must have the first character to be uppercase.

Since few of the world's scripts distinguish between upper and lower case, a sigil is provided to disambiguate constructor and module names and all other identifiers. When an identifier is prepended with @, it is treated as if its first letter was uppercase.

See technical details for specifics.

m17n is compatible with ppx syntax extensions such as ppx_deriving.

m17n includes integration both with the standard OCaml toplevel and utop.

m17n does not add Unicode literals or any runtime support for manipulating Unicode strings and characters. (See OCaml pull #80, Uutf, Uunf and Uucd projects.)

Can't look-alike characters like a and а be confusing?

They can.

However, m17n issues a warning if more than one script is used in an identifier, hopefully handling most of the confusing cases. You can still use several scripts if you separate them by underscores, e.g. show_色.

Additionally, m17n issues a warning if any two identifiers look alike enough to be visually confusable.

Localized error messages?

This will be possible when PR6696 is fixed.

Are RTL scripts supported?

In theory, yes. However I lack ability to verify whether the RTL support works correctly. Open an issue if it is not.

Technical details

Unicode handling

m17n includes only five changes to the OCaml lexer:

  • U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), U+000C FORM FEED (FF), U+0020 SPACE and U+3000 IDEOGRAPHIC SPACE are recognized as whitespace,
  • Characters of General Category Lu are recognized as uppercase letters at the start of an identifier,
  • Characters of General Categories Ll, Lm, Lo, Lt and U+005F LOW LINE are recognized as lowercase letters at the start of an identifier,
  • Characters with property ID_Continue are recognized as continuation of an identifier,
  • U+0040 COMMERCIAL AT makes the following lowercase or unicase letter recognized as an uppercase letter. If it is possible to map the letter to upper case, this is done.

To summarize, an identifier may start with ID_Start, and continue with ID_Continue.

All identifiers are normalized to NFC. However, strings are not normalized.

These rules closely follow the recommendations of Unicode TR31. Formally, m17n conforms to Unicode 6.3 UAX31 Level 1, observing R1 and R3 with the profile specified above and R4 unconditionally with normalization to NFC.

Detecting confusable characters

m17n follows Unicode TR39 at Unicode 7.0.0 in its handling of confusable characters.

Within a single identifier chunk (a part of an identifier separated by U+005F LOW LINE), all characters should satisfy the Highly Restrictive mixed script detection.

Within a single source chunk, all identifiers should not be confusable according to the Mixed-Script Any-Case table.

If any of this is not true, m17n issues a warning.

Interaction with filesystem

OCaml uses module names to search the include path for referenced modules. As the module names are normalized to NFC, the queries to the filesystem use the same form. Different operating systems handle them in different ways:

  • Mac OS X on HFS+ stores the filenames in NFD and normalizes all input to NFD. No edge cases possible.
  • Other *nix systems such as Linux treat filenames as opaque /-delimited, NUL-finalized streams of bytes, however essentially all existing input methods normalize to NFC.
  • Windows treats filenames as opaque streams of UTF-16 characters with somewhat more complex interpretation. It performs its own case folding (in most cases; case-sensitive Windows filesystems exist), but no normalization. Its input methods normalize to NFC as well.

m17n aims to reduce possible confusion by looking into the include directories and looking for any OCaml build products whose basenames are identical to the names of any referenced modules under toNFKC_Casefold (definition R5 in Unicode 6.3 section 3.13), but not as-is. This measure should be enough to not only catch all instances of mis-normalized filenames, but also incorrect capitalization and some look-alike characters.

Interaction with the OCaml compiler

The OCaml compiler has an -pp option which, among other things, allows to provide a binary that accepts source code and emits an OCaml abstract syntax tree, thus allowing to implement a custom frontend. (This is what camlp4 uses.)

Additionally, the OCaml compiler exports its internals, including the parser, in a package compiler-libs, thus allowing to avoid reimplementing the parser in a custom frontend.

The compiler treats the identifiers as opaque tokens almost everywhere. It does not even concatenate them, which is important, as NFC is not generally closed under concatenation. The only place where the compiler actually dissects the strings is the module name → filename mapping. However, it ignores bytes with values over 127, passing UTF-8 strings through.

Findlib provides an interface that allows registering a preprocessor. Additionally, it will pass all package include paths to such a preprocessor.

m17n uses all these features and implementation details to provide a seamless Unicode-aware frontend.

Acknowledgements

The zh_CN translation was contributed by Tao Stein.

License

MIT license

About

Multilingualization for the OCaml source code

License:MIT License


Languages

Language:OCaml 99.3%Language:Makefile 0.6%Language:Standard ML 0.1%