ELLIOTTCABLE / ocaml-string-convert

Convert between JavaScript UCS-2-encoded strings and OCaml-friendly UTF-8 byte-arrays.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ocaml-string-convert

This is a library to shim BuckleScript's string-handling when using native-OCaml string-manipulation libraries.

Background

When using BuckleScript to compile OCaml source-code to JavaScript, no attempt is made to handle the runtime conversion of string values between the two semantic systems.

In particular, a String value in JavaScript is (basically) a UCS-2 character array. The closest you can get to "give me the thingie X thingies into the length of the string" is String::charCodeAt; specifically, this function returns Nth UTF-16 code unit into the string.

Meanwhile, over in OCaml land, the type string (and the functions in the module String) is, semantically, a dumb byte array. That is, when you ask the OCaml compiler for a_string.[0], you don't get the first character of the string, or even a Unicode-aware codepoint or grapheme; instead, you get the first byte of (what OCaml believes to be) a series of opaque bytes.

Unfortunately, BuckleScript compiles the latter syntax (a_string.[0]) into the former semantic (a_string.charCodeAt(0)); this only makes sense within the very limited range of the ASCII-compatible bytes; that is, between 0-127.

Let's experiment with the following small program. It'll take an input string on the command-line, extract the first ... character? byte? and then tell us about it.

(* str_test.ml *)
let first_char_info s =
   let c = s.[0] in
   "Code: " ^ string_of_int (Char.code c) |> print_endline;
   "String: " ^ String.make 1 c |> print_endline

(* Change the "1" to a "2" to execute this with Node.js. Annoyingly. *)
let () = first_char_info Sys.argv.(1)

The above works, both when compiled via the traditional OCaml toolchain, and when compiled to JavaScript and executed with Node.js ... but only when the entire string is within the ASCII range:

$ bsc str_test.ml
$ node str_test.js hello
Code: 104
String: h

$ ocaml str_test.ml hello
Code: 104
String: h

Let's try the same thing with an non-ASCII, international string:

$ node str_test.js جمل
Code: 1580
String: ج

$ ocaml str_test.ml جمل
Code: 216
String: ?

Ruh-roh. The problem here comes from this series of exchanges:

  1. The value s in the above program comes in as a UTF-8 encoded string; that's what the shell is passing along to the program in Sys.argv.

  2. Node.js understands and expects this; and converts the incoming value into its internal format, UCS-2; this means that s.charCodeAt(0) is going to be the first UCS code-point of that input string as encoded in UCS-2. That is to say, "ج", integer value 1580.

  3. An OCaml program, unaware that it's being compiled via BuckleScript, expects string values arising from UTF-8 input (like s) to be addressed bytewise; that is, they'd expect s.[0] to yield "\xD8" (216) and s.[1] to yield "\xAC" (172), the two bytes of the UTF-8 encoding of the codepoint ‘ج’.

tl;dr OCaml libraries expecting to operate UTF-8 byte-arrays (like Sedlex, Menhir, Camomile, any of Daniel Bünzli's Unicode-handling libraries) are going to break when compiled to JavaScript via BuckleScript and fed actual UTF-8 input.

Solution

This library provides a shim for this behaviour. Unicode input to a JavaScript program can be fed through the functions provided by this library, which uses the TextEncoder and TextDecoder APIs (or the fast-text-encoding npm module as a shim therefor) to transform the UCS-2 strings being passed around by JavaScript systems, into TypedArrays of UTF-8 bytes. These UTF-8 values will then be copied back into (now malformed, but predictably-malformed) JavaScript Strings; these can be passed with impunity to UTF-8 handling OCaml functions, which will now function as expected.

Note: This package is not necessary for code written specifically for BuckleScript; just be aware of the BuckleScript-specific semantics of the .[] string-indexing operator. This package is only necessary if you're A. writing a library that's intended to be used both by native projects and JavaScript projects, or B. if you're using a native-targeting library from opam and compiling it to JavaScript.

Usage

Install ocaml-string-convert with npm:

npm install --save ocaml-string-convert

Include it on the JavaScript side of your project:

import {
   toFakeUTF8String,
   fromFakeUTF8String
} from 'ocaml-string-convert'

toFakeUTF8String(str)

This function is intended to be called on JavaScript strings (possibly containing Unicode characters outside the ASCII range) that need to be passed to OCaml functions; it ‘double-encodes’ those strings such that they will be perceived by BuckleScript-compiled OCaml as UTF-8-encoded char-arrays.

Input

This function takes one argument, a ‘standard’ JavaScript String; that is, one with Unicode characters outside the ASCII range (but still within the BMP!) encoded as single, 16-bit code-units; and higher-plane characters encoded as UTF-16-style surrogate pairs.

  • Example, as a UCS-2 sequence of 16-bit code-units:

    [102, 111, 111, 183, 98, 97, 114]
  • Example, as typed into a UTF-8 JavaScript source-file:

    "foo·bar"

Output

An abomination. This produces a JavaScript String (that is still technically encoded as UCS-2, mind you!) containing a series of UTF-8 bytes, as interpreted as UCS-2 codepoints.

  • Example, as a UCS-2 sequence of 16-bit code-units:

    [102, 111, 111, 194, 183, 98, 97, 114]
  • Example, as typed into a UTF-8 JavaScript source-file:

    "foo\xC2\xB7bar" // or "foo·bar", if you're a heathen

See that, in this example, the non-ASCII character U+00B7 “MIDDLE DOT”, which is one code-unit (literally \u00B7) in the original input-string, is encoded as two JavaScript / UCS-2 code-units, \xC2\xB7 — C2-B7 being the UTF-8 encoding of U+00B7.

fromFakeUTF8String(str)

The inverse operation to the above.

Given a double-encoded (effectively, mis-encoded) BuckleScript ‘string’ that's been manipulated as if it's a UTF-8 char-array, this function will decode (effectively, re-encode) that value into a functional, correct JavaScript (i.e. UCS-2) string.

Takes a String, containing a series of UTF-8 bytes encoded as Unicode codepoints (in JavaScript's standard UCS-2, that is); returns a standard JavaScript String with those Unicode scalars properly represented in UCS-2 code units, ready for standard JavaScript manipulation.

A Note on Types

Given that readers of this are almost guaranteed to write OCaml, it will probably surprise noboby that I prefer the ability to use nominal types. This is not, however, standard TypeScript practice.

This library's TypeScript interface (which I hope I'm exporting correctly, by the way; I'm rather new to publishing a TypeScript-enabled library!) mints a new type for string_as_utf_8_buffer. Idiomatic usage would be to tag every stringish return-value from a BuckleScript module with this type:

import { toFakeUTF8String, fromFakeUTF8String } from 'ocaml-string-convert'
import $AModule from './aModule.bs'

let $yuck = $AModule.returns_a_string() as string_as_utf_8_buffer
// ... manipulation ...
let str = fromFakeUTF8String($yuck)

(As you can see, I also like to follow a different naming-convention for values I know to contain opaque values produced by the BuckleScript runtime.)

You can, of course, dispense with my convention at your earliest convenience, if you can't stand the (hopefully helpful?) type-errors that this produces; I do not, of course, suggest that you do so:

import { toFakeUTF8String, fromFakeUTF8String } from 'ocaml-string-convert'
import $AModule from './aModule.bs'

function from(str: string): string {
  fromFakeUTF8String(str as string_as_utf_8_buffer)
}

let $yuck = $AModule.returns_a_string()
// ... manipulation ...
let str = from($yuck)

About

Convert between JavaScript UCS-2-encoded strings and OCaml-friendly UTF-8 byte-arrays.


Languages

Language:TypeScript 86.1%Language:JavaScript 13.9%