whatwg / encoding

Encoding Standard

Home Page:https://encoding.spec.whatwg.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Half-width Katakana should be representable in ISO-2022-JP

hsivonen opened this issue · comments

A query string-based (and, therefore, IE/Edge-incompatible) test shows that Gecko, WebKit, Blink and Presto can encode half-width Katakana as ISO-2022-JP without NCRs.

The spec should be amended to match (both encoder and decoder side).

It seems this is a special feature of the encoder only: "ミミ" both encode to 0x25 0x5F. I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

I wonder if all Japanese encoders first convert halfwidth to fullwidth now.

No, ISO-2022-JP only.

You'd think someone would have already written and published an algorithm for this conversion. I guess I'll just find the mapping for each code point myself.

Okay, so I guess what we want to do is to apply Unicode Normalization Form KC on any code point in the range U+FF65 to U+FF9F, inclusive.

const start = 0xFF61,
      end = 0xFF9F + 1;
for(let i = start; i < end; i++) {
  const cp = String.fromCodePoint(i),
        fullwidthCP = cp.normalize("NFKC");
  // ...
}

If I write those out and use @hsivonen's demo I get the results I was expecting per the above analysis.

Correct me if I'm wrong, but wouldn't halfwidth katakana involve a switch to JIS X 0201 (Roman) mode? There's no need to destroy the round-trip by normalizing to fullwidth.

@aphillips that is not what implementations do.

@annevk Yes, although that seems like a bug in the coders. I saw this thread this morning and was surprised, since I recall having to implement this when I was writing an ISO-2022 coder about 15 years ago. I can't imagine that the encoding's formal definition has changed, so I'm surprised to see implementations doing this.

Sure, but after such a long time bugs become features.

@aphillips @annevk
I wouldn't call it a bug.

On the Internet/Web, only the original ISO-2022-JP defined in RFC 1468 was "widely" (relative to subsequent versions) used, but subsequent versions of ISO-2022-JP, ISO-2022-JP-[123] never got much traction. Why would anybody use ISO-2022-JP-* to encode Chinese, Korean, Latin beyond ASCII, and Greek? And, JIS X 0212 (supported in ISO-2022-JP-1 or later) is not critical enough to Japanese users (Shift_JIS does not support it, either).

The original ISO-2022-JP does not support Halfwidth Katakana. That's why ICU has a fallback encoding for Halfwidth Katakana for the original ISO-2022-JP.

It's only ISO-2022-JP-3 that supports Halfwidth Katakana. ICU supports ISO-2022-JP-3 as defined and does not have fallback encoding for Halfwidth Katakana in ISO-2022-JP-3.
Note that ISO-2022-JP-2 defined in RFC 1554 does not support them either.

Note that we do support decoding halfwidth Katakana: https://encoding.spec.whatwg.org/#iso-2022-jp-decoder-katakana. Should we remove that then?

Why? If we see the byte sequence and it isn't invalid, why not decode it?

Note that this encoding is primarily used for email, not web pages.

Sorry, that suggestion was rather flippant and I should have looked at https://w3c-test.org/encoding/iso-2022-jp-decoder.html in various browsers first, which shows it's supported (though not sure about Edge).

It just shows that @jungshik's story above is not really complete as browsers support ISO-2022-JP-3's halfwidth Katakana extension on the decoder side (in what they call ISO-2022-JP).

To be 100% clear: suggestion retracted.

Sorry for the confusion. It turned out that ICU's ISO-2022-JP converter (and other converters used in browsers) supports Halfwidth Katakana ("ESC ( I") in the spirit of 'be lenient in what you accept and be strict in what you emit'. For instance, it's explicitly commented in ICU's ucnv2022.cpp

 Note: The converter uses some leniency:
 - The escape sequence ESC ( I for half-width 7-bit Katakana is recognized in
    all versions, not just JIS7 and JIS8.
.....
static const uint16_t jpCharsetMasks[MAX_JA_VERSION+1]={
    CSM(ASCII)|CSM(JISX201)|CSM(JISX208)|CSM(HWKANA_7BIT),  <== ISO-2022-JP version 0 still has HWKANA_7BIT