nayuki / QR-Code-generator

High-quality QR Code generator library in Java, TypeScript/JavaScript, Python, Rust, C++, C.

Home Page:https://www.nayuki.io/page/qr-code-generator-library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UTF-8 encoding without ECI segment indicating deviation from standard encoding

YourMJK opened this issue · comments

From what I could find online, it seems like the official QR code standard still specifies ISO-8859-1 (Latin-1) as the default text encoding for byte segments.
However, if I understand the code correctly, this library chose to encode text (that is not alphanumeric or numeric) in a byte segment using UTF-8 encoding without preceding it with a corresponding ECI segment that indicates that an encoding other than ISO-8859-1 was used.

From QrSegment.java:135:
result.add(makeBytes(text.toString().getBytes(StandardCharsets.UTF_8)));

What's the reasoning behind this?
I understand that a lot of readers nowadays use heuristics (which can fail) to guess the used encoding in the byte segment anyway but doesn't this behaviour make the produced QR codes technically non-compliant?

Thanks in advance.

You are right. While researching about ECI just now, I stumbled on this relevant article: https://www.linkedin.com/pulse/enhanced-channel-interpretation-terry-burton/

In this example the application has read a non-ECI message, having symbology identifier "]Q1", and has performed an auto-detection of the character set choosing UTF-8. Recall that this is prohibited by the barcode standards which specify that all non-ECI data must be rendered using the default character set for the symbology, which is normally Latin-1.

The reasoning behind my code logic is that when I tried turning raw UTF-8 text into QR Codes and scanning them with Barcode Scanner for Android, I got the correct results. Your guess about heuristics (try decoding in UTF-8 first, then fall back to Latin-1) is probably correct.

Since this code logic "works" (in practice), I hesitate to change it and consume more bytes in the barcode. Also, the world has changed in the past 2 decades, and UTF-8 Everywhere is a reality. (Random examples: Rust str is UTF-8; JDK 18 introduces JEP 400: UTF-8 by Default.)

Thanks for your response and the very interesting references. I figured that this probably isn't a problem anymore today, you are right that UTF-8 is the standard almost everywhere (even where it shouldn't…).
I was just interested in whether this was a conscious decision or more of an "oversight". I also just recently found out about this while researching QR codes in the last week.

However, I still think it wouldn't hurt to maybe add a high-level option to the text based constructors for including an ECI segment before the UTF-8 encoded text. For people who want to be as close to the specification as possible but don't know about all the technical details.
At least this behaviour should be mentioned somewhere, in the code documentation or the website for example (sorry if it already is and I missed it).

Thank you for this library. I'm using a Swift translation of it in a project of mine.

If I recall correctly, I blindly implemented UTF-8 text bytes at first and it just worked with ZXing's Barcode Scanner app. A year later, I implemented ECI segments, and not long after I realized that UTF-8 wasn't actually the default character encoding when no ECI is specified.

Your suggestion to automatically include a UTF-8 ECI is not a bad one, and it's true that I didn't document the non-ECI behavior anywhere. But I'll put this topic on hold as I'm not enthusiastic about changing my library in this way.

Okay, I understand.
I hope you'll include a short sentence about this on the website, this could be very relevant information for some people.

FYI: I just checked it and not even iOS's built-in QR code reader (15.5 and 14.8) can properly recognize Latin-1 encoded text unless all characters map to the same bytes as UTF-8 encoding! It only decodes Latin-1 correctly if the ECI segment with designator 3 (for Latin-1 encoding) is prepended.

So perhaps you are right and UTF-8 without any ECI is even better supported than the actual standard.
Long story short: it's probably fine the way it is and adding the ECI may even reduce compatibility.

Thank you for the experiments and additional information!