Lexicographic order of N-Quads
clehner opened this issue · comments
Sorting in lexicographic order is done not just of hashes (#14) but also of N-Quads; in Hash First Degree Quads:
https://github.com/json-ld/rdf-dataset-canonicalization/blob/5661d1aa69532489290bc90e388f54608cd55465/spec/index.html#L644
N-Quads are specified as Unicode encoded as UTF-8:
- https://www.w3.org/TR/n-quads/#sec-mediatype
The content encoding of N-Quads is always UTF-8
- https://www.w3.org/TR/n-quads/#sec-grammar
An N-Quads document is a Unicode[UNICODE] character string encoded in UTF-8
Therefore I would assume to sort nquads
as UTF-8 or Unicode strings.
UTF-8 has the property that bytewise comparisons of UTF-8 strings are equivalent to codepoint comparisons of the corresponding string of Unicode codepoints, i.e. code point (character) order is the same as binary/byte order:
- https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_encodings
Sorting order: The chosen values of the leading bytes means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.
- https://stackoverflow.com/a/4611330
the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.
However, it may be known that string comparisons in ECMAScript/JavaScript are not by UTF-8/Unicode, but by UTF-16 or UCS-2:
- https://mathiasbynens.be/notes/javascript-encoding
JavaScript engines are allowed to use either UCS-2 or UTF-16
[...] The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16. - https://262.ecma-international.org/5.1/#sec-8.4
All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers
- https://262.ecma-international.org/5.1/#sec-4.3.16
Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text
JSON Canonicalization Scheme (JCS) (RFC 8785) also encodes object properties as UTF-16 for sorting: https://datatracker.ietf.org/doc/html/rfc8785#section-3.2.3 (although it suggests using UTF-8 as a final serialization step for cryptographic purposes: https://datatracker.ietf.org/doc/html/rfc8785#section-3.2.4)
It looks like Java strings use UTF-16 as well, although it has functions based on Unicode code points; it's not clear to me if string comparisons would be using UTF-16 or Unicode code points...
- https://docs.oracle.com/javase/8/docs/api/java/lang/String.html
- https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#compareTo-java.lang.String-
- https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--
Should it be noted that URDNA2015 uses UTF-8 in its lexicographical sorting, if this is the case?
Also, FWIW, the Unicode Standard mentions lexicographic ordering in one place: https://www.unicode.org/versions/Unicode13.0.0/ch02.pdf
For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering
exists. See Unicode Technical Standard#10
, “Unicode Collation Algorithm,” for the standard default mechanism for comparing Unicode strings.
But I'm pretty sure we don't mean to sort using Unicode Collation Algorithm, at least because it is not fixed/stable: https://www.unicode.org/reports/tr10/#Common_Misperceptions