JuliaStrings / utf8proc

a clean C library for processing UTF-8 Unicode data

Home Page:http://juliastrings.github.io/utf8proc/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Shouldn't the codes 0xC0 and 0xC1 be of length 0?

ambarj2009 opened this issue · comments

Hi guys.

According to the UTF-8 encoding, the values 0xC0 and 0xC1 are not used because it would represent a code point of basic Latin, that is, one of the ASCII codes, with 2 bytes when UTF-8 is prepared in such a way that it is compatible with ASCII encoding.

However, in the utf8proc_utf8class table, these values are assigned a length of 2 bytes, when, to my understanding, they should be length 0.

So:

What is the motivation that 0xC0 and 0xC1 have length 2?

Thanks.

No, the table is correct. U+00C0 is a valid Unicode codepoint (the character À), and is encoded in UTF-8 as two bytes: 0xc3, 0x80.