Half-width Katakana should be representable in ISO-2022-JP

Question

Half-width Katakana should be representable in ISO-2022-JP

hsivonen opened this issue 7 years ago · comments

A query string-based (and, therefore, IE/Edge-incompatible) test shows that Gecko, WebKit, Blink and Presto can encode half-width Katakana as ISO-2022-JP without NCRs.

The spec should be amended to match (both encoder and decoder side).

Anne van Kesteren · Answer 1 · Fri May 05 2017 18:07:16 GMT+0800 (China Standard Time)

It seems this is a special feature of the encoder only: "ﾐミ" both encode to 0x25 0x5F. I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

vyv03354 · Answer 2 · Fri May 05 2017 18:11:35 GMT+0800 (China Standard Time)

I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

No, ISO-2022-JP only.

Anne van Kesteren · Answer 3 · Fri May 05 2017 18:29:58 GMT+0800 (China Standard Time)

You'd think someone would have already written and published an algorithm for this conversion. I guess I'll just find the mapping for each code point myself.

Anne van Kesteren · Answer 4 · Fri May 05 2017 18:42:43 GMT+0800 (China Standard Time)

Okay, so I guess what we want to do is to apply Unicode Normalization Form KC on any code point in the range U+FF65 to U+FF9F, inclusive.

Anne van Kesteren · Answer 5 · Fri May 05 2017 18:52:56 GMT+0800 (China Standard Time)

const start = 0xFF61,
      end = 0xFF9F + 1;
for(let i = start; i < end; i++) {
  const cp = String.fromCodePoint(i),
        fullwidthCP = cp.normalize("NFKC");
  // ...
}

If I write those out and use @hsivonen's demo I get the results I was expecting per the above analysis.

Addison Phillips · Answer 6 · Sat May 06 2017 00:45:15 GMT+0800 (China Standard Time)

Correct me if I'm wrong, but wouldn't halfwidth katakana involve a switch to JIS X 0201 (Roman) mode? There's no need to destroy the round-trip by normalizing to fullwidth.

Anne van Kesteren · Answer 7 · Sat May 06 2017 00:55:59 GMT+0800 (China Standard Time)

@aphillips that is not what implementations do.

Addison Phillips · Answer 8 · Sat May 06 2017 01:28:47 GMT+0800 (China Standard Time)

@annevk Yes, although that seems like a bug in the coders. I saw this thread this morning and was surprised, since I recall having to implement this when I was writing an ISO-2022 coder about 15 years ago. I can't imagine that the encoding's formal definition has changed, so I'm surprised to see implementations doing this.

Anne van Kesteren · Answer 9 · Sat May 06 2017 01:53:54 GMT+0800 (China Standard Time)

Sure, but after such a long time bugs become features.

Jungshik Shin · Answer 10 · Sun May 07 2017 06:10:25 GMT+0800 (China Standard Time)

@aphillips @annevk
I wouldn't call it a bug.

On the Internet/Web, only the original ISO-2022-JP defined in RFC 1468 was "widely" (relative to subsequent versions) used, but subsequent versions of ISO-2022-JP, ISO-2022-JP-[123] never got much traction. Why would anybody use ISO-2022-JP-* to encode Chinese, Korean, Latin beyond ASCII, and Greek? And, JIS X 0212 (supported in ISO-2022-JP-1 or later) is not critical enough to Japanese users (Shift_JIS does not support it, either).

The original ISO-2022-JP does not support Halfwidth Katakana. That's why ICU has a fallback encoding for Halfwidth Katakana for the original ISO-2022-JP.

It's only ISO-2022-JP-3 that supports Halfwidth Katakana. ICU supports ISO-2022-JP-3 as defined and does not have fallback encoding for Halfwidth Katakana in ISO-2022-JP-3.
Note that ISO-2022-JP-2 defined in RFC 1554 does not support them either.

Anne van Kesteren · Answer 11 · Sun May 07 2017 12:43:17 GMT+0800 (China Standard Time)

Note that we do support decoding halfwidth Katakana: https://encoding.spec.whatwg.org/#iso-2022-jp-decoder-katakana. Should we remove that then?

Addison Phillips · Answer 12 · Sun May 07 2017 13:07:16 GMT+0800 (China Standard Time)

Why? If we see the byte sequence and it isn't invalid, why not decode it?

Note that this encoding is primarily used for email, not web pages.

Anne van Kesteren · Answer 13 · Sun May 07 2017 13:37:34 GMT+0800 (China Standard Time)

Sorry, that suggestion was rather flippant and I should have looked at https://w3c-test.org/encoding/iso-2022-jp-decoder.html in various browsers first, which shows it's supported (though not sure about Edge).

It just shows that @jungshik's story above is not really complete as browsers support ISO-2022-JP-3's halfwidth Katakana extension on the decoder side (in what they call ISO-2022-JP).

To be 100% clear: suggestion retracted.

Jungshik Shin · Answer 14 · Mon May 08 2017 00:24:13 GMT+0800 (China Standard Time)

Sorry for the confusion. It turned out that ICU's ISO-2022-JP converter (and other converters used in browsers) supports Halfwidth Katakana ("ESC ( I") in the spirit of 'be lenient in what you accept and be strict in what you emit'. For instance, it's explicitly commented in ICU's ucnv2022.cpp

 Note: The converter uses some leniency:
 - The escape sequence ESC ( I for half-width 7-bit Katakana is recognized in
    all versions, not just JIS7 and JIS8.
.....
static const uint16_t jpCharsetMasks[MAX_JA_VERSION+1]={
    CSM(ASCII)|CSM(JISX201)|CSM(JISX208)|CSM(HWKANA_7BIT),  <== ISO-2022-JP version 0 still has HWKANA_7BIT