Lexicographic order of N-Quads

Question

Lexicographic order of N-Quads

clehner opened this issue 2 years ago · comments

Sorting in lexicographic order is done not just of hashes (#14) but also of N-Quads; in Hash First Degree Quads:
https://github.com/json-ld/rdf-dataset-canonicalization/blob/5661d1aa69532489290bc90e388f54608cd55465/spec/index.html#L644

N-Quads are specified as Unicode encoded as UTF-8:

https://www.w3.org/TR/n-quads/#sec-mediatype

The content encoding of N-Quads is always UTF-8
https://www.w3.org/TR/n-quads/#sec-grammar

An N-Quads document is a Unicode[UNICODE] character string encoded in UTF-8

Therefore I would assume to sort nquads as UTF-8 or Unicode strings.

UTF-8 has the property that bytewise comparisons of UTF-8 strings are equivalent to codepoint comparisons of the corresponding string of Unicode codepoints, i.e. code point (character) order is the same as binary/byte order:

https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_encodings

Sorting order: The chosen values of the leading bytes means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.
https://stackoverflow.com/a/4611330

the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.

However, it may be known that string comparisons in ECMAScript/JavaScript are not by UTF-8/Unicode, but by UTF-16 or UCS-2:

https://mathiasbynens.be/notes/javascript-encoding

JavaScript engines are allowed to use either UCS-2 or UTF-16
[...] The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
https://262.ecma-international.org/5.1/#sec-8.4

All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers
https://262.ecma-international.org/5.1/#sec-4.3.16

Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text

JSON Canonicalization Scheme (JCS) (RFC 8785) also encodes object properties as UTF-16 for sorting: https://datatracker.ietf.org/doc/html/rfc8785#section-3.2.3 (although it suggests using UTF-8 as a final serialization step for cryptographic purposes: https://datatracker.ietf.org/doc/html/rfc8785#section-3.2.4)

It looks like Java strings use UTF-16 as well, although it has functions based on Unicode code points; it's not clear to me if string comparisons would be using UTF-16 or Unicode code points...

Should it be noted that URDNA2015 uses UTF-8 in its lexicographical sorting, if this is the case?

Also, FWIW, the Unicode Standard mentions lexicographic ordering in one place: https://www.unicode.org/versions/Unicode13.0.0/ch02.pdf

For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering
exists. See Unicode Technical Standard #10, “Unicode Collation Algorithm,” for the standard default mechanism for comparing Unicode strings.

But I'm pretty sure we don't mean to sort using Unicode Collation Algorithm, at least because it is not fixed/stable: https://www.unicode.org/reports/tr10/#Common_Misperceptions