zenkaku-string

String functions for handling East Asian wide characters in terminals

In CJK (Chinese, Japanese and Korean) text, "wide" or "fullwidth" characters (or zenkaku in Japanese) are Unicode glyphs that get printed as two blocks wide instead of one when using a fixed-width font. Examples include ranges like the Japanese kana (あいうえお), fullwidth romaji (ＡＢＣＤＥ), and kanji/hanzi ideographs.

Since these characters are printed as two blocks, but count as one, this causes a problem when trying to accurately measure the length of the string for use in a fixed-width text environment such as the terminal—a string containing one fullwidth character will visually appear to be one character longer than its length value would indicate. This causes e.g. tabulated layouts to be broken.

To work around this, the functions in this library treat wide characters in strings as though they have a length of 2.

For a full list of the character ranges that are deemed to be "wide", see the characters.js source.

⚠️ Note: this library is not ready to be used yet. ⚠️

Usage

The library is available as zenkaku-string on npm:

npm install --save zenkaku-string

It provides the following exports:

Function name	Exact	Replacement for
`wideCharAt(str, idx[, padChar])`	No†	`"str".charAt(idx)`
`wideIndexOf(str, searchVal[, beginIdx])`	Yes	`"str".indexOf(searchVal[, beginIdx])`
`wideLastIndexOf(str, searchVal[, beginIdx])`	Yes	`"str".lastIndexOf(searchVal[, beginIdx])`
`wideLength(string)`	Yes	`"str".length`
`widePadEnd(str, targetLength[, padChar])`	Yes	`"str".padEnd(targetLength[, padChar])`
`widePadStart(str, targetLength[, padChar])`	Yes	`"str".padStart(targetLength[, padChar])`
`wideSlice(str, beginIdx[, endIdx[, padChar]])`	No‡	`"str".slice(beginIdx[, endIdx])`
`wideSubstr(str, beginIdx[, length[, padChar]])`	No‡	`"str".substr(beginIdx[, length])`
`wideSubstring(str, beginIdx[, endIdx[, padChar]])`	No‡	`"str".substring(beginIdx[, endIdx])`
Export name	Description
`charRangeRe`	`RegExp` object used to match W/FW characters
`charRange.wide`	Array of strings representing all wide character ranges
`charRange.fullWidth`	Array of strings representing all fullwidth character ranges

†: returns a single padding character if the second half of wide character is the result; see ¶ Ambiguity below.
‡: pads start or end with a single padding character if half a wide character is included in resp. the start or end of the result string.

Ambiguity (string padding)

The functions have the same interface as the ones they replace and behave the exact same—in principle, as there is one important exception: since individual wide characters count for two, that means it's possible to "slice them in half".

First, some well-defined examples:

const { wideLength, wideSlice } = require('zenkaku-string')

// For visualization purposes, as the inner two kanji are wide characters,
// this string will be represented as "a1122b" where "11" and "22" are our kanji.

const farm = 'a牧場b' // a1122b

                                    // [    ]
console.log(wideLength(farm))       // a1122b → 6 (as "牧" and "場" count for 2)

                                    // [ ]---
console.log(wideSlice(farm, 0, 3))  // a1122b → "A牧"

                                    // ---[ ]
console.log(wideSlice(farm, 3, 6))  // a1122b → "場b"

                                    // -[  ]-
console.log(wideSlice(farm, 1, 5))  // a1122b → "牧場"

The following examples are problematic, however:

                                    // [  ]--
console.log(wideSlice(farm, 0, 4))  // a1122b → "A牧 " (!)

                                    // --[  ]
console.log(wideSlice(farm, 2, 6))  // a1122b → " 場b" (!)

In these last two examples we're slicing a kanji character down the middle, and we can't return half a character.

Since this library always aims to returns a string of a predictable length, it replaces half characters with a padding character. The default padding character is a single space (U+0020), but it can be specified as the last argument to each function.

Note that a padding character is always used as-is—there is no extra length calculation done on the padding character, so if a wide character is used the string will look longer than if a non-wide character were used. It's expected that a non-wide character is used for padding, even for widePadStart() and widePadEnd().

Matching wide characters

If you need to process a string's wide characters in some other way, you can import the regex used to match them:

const { charRangeRe } = require('zenkaku-string')

console.log(charRangeRe instanceof RegExp)  // true

The charRangeRe RegExp object is structured like new RegExp('[\u1100-\u11F9\u3000-\u303F .. etc. \uFFE0-\uFFE6]') and has no flags set.

If you need to match characters globally or use other flags, construct a new RegExp object:

const charRangeReGlobal = new RegExp(charRangeRe, 'g')
charRangeReGlobal.lastIndex = 0  // remember that these are stateful!

For even more low level access, the actual W/FW character ranges used to construct this regex are also exported as the charRange object.

Examples

See examples/table.js for an example script showing how this library can be used to display tables containing Latin characters and Japanese fullwidth characters. It's meant to run inside a terminal.

TODO: add simple code example.

Related libraries

All other Zenkaku projects use the string manipulation functions provided by this library, and are designed for use in building terminal applications.

zenkaku-wrap - Line wrapping with CJK support
zenkaku-table - Table generation with CJK support

Sources

Unicode Standard Annex #11 - Report on East Asian Width property
Unicode Technical Report #11 (contains a full list of W/FW character ranges)
East Asian line breaking rules - Article from Wikipedia

msikma / zenkaku-string