Shouldn't the codes 0xC0 and 0xC1 be of length 0?

Question

Shouldn't the codes 0xC0 and 0xC1 be of length 0?

ambarj2009 opened this issue 3 years ago · comments

JESÚS CHICA PRIETO commented 3 years ago

Hi guys.

According to the UTF-8 encoding, the values 0xC0 and 0xC1 are not used because it would represent a code point of basic Latin, that is, one of the ASCII codes, with 2 bytes when UTF-8 is prepared in such a way that it is compatible with ASCII encoding.

However, in the utf8proc_utf8class table, these values are assigned a length of 2 bytes, when, to my understanding, they should be length 0.

So:

What is the motivation that 0xC0 and 0xC1 have length 2?

Thanks.

Steven G. Johnson · Answer 1 · Fri Dec 17 2021 10:00:26 GMT+0800 (China Standard Time)

No, the table is correct. U+00C0 is a valid Unicode codepoint (the character À), and is encoded in UTF-8 as two bytes: 0xc3, 0x80.