w3c-ccg / rdf-dataset-canonicalization

RDF Dataset Canonicalization

Home Page:https://www.w3.org/TR/rdf-canon/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lexicographic order of N-Quads

clehner opened this issue · comments

Sorting in lexicographic order is done not just of hashes (#14) but also of N-Quads; in Hash First Degree Quads:
https://github.com/json-ld/rdf-dataset-canonicalization/blob/5661d1aa69532489290bc90e388f54608cd55465/spec/index.html#L644

N-Quads are specified as Unicode encoded as UTF-8:

Therefore I would assume to sort nquads as UTF-8 or Unicode strings.

UTF-8 has the property that bytewise comparisons of UTF-8 strings are equivalent to codepoint comparisons of the corresponding string of Unicode codepoints, i.e. code point (character) order is the same as binary/byte order:

  • https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_encodings

    Sorting order: The chosen values of the leading bytes means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.

  • https://stackoverflow.com/a/4611330

    the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.

However, it may be known that string comparisons in ECMAScript/JavaScript are not by UTF-8/Unicode, but by UTF-16 or UCS-2:

JSON Canonicalization Scheme (JCS) (RFC 8785) also encodes object properties as UTF-16 for sorting: https://datatracker.ietf.org/doc/html/rfc8785#section-3.2.3 (although it suggests using UTF-8 as a final serialization step for cryptographic purposes: https://datatracker.ietf.org/doc/html/rfc8785#section-3.2.4)

It looks like Java strings use UTF-16 as well, although it has functions based on Unicode code points; it's not clear to me if string comparisons would be using UTF-16 or Unicode code points...

Should it be noted that URDNA2015 uses UTF-8 in its lexicographical sorting, if this is the case?

Also, FWIW, the Unicode Standard mentions lexicographic ordering in one place: https://www.unicode.org/versions/Unicode13.0.0/ch02.pdf

For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering
exists. See Unicode Technical Standard #10, “Unicode Collation Algorithm,” for the standard default mechanism for comparing Unicode strings.

But I'm pretty sure we don't mean to sort using Unicode Collation Algorithm, at least because it is not fixed/stable: https://www.unicode.org/reports/tr10/#Common_Misperceptions