decodeUtf8Lenient emits too many replacement characters (at least more than python and rust)
HugoPeters1024 opened this issue · comments
Summary
I encountered this discrepancy whilst working on a custom decodeUtf8Lenient
function as a workaround for haskell/bytestring#575.
decodeUtf8Lenient
is a complete function because when it encounters invalid bytes it replaces them with U+FFFD
(�). It will however keep replacing all bytes with that replacement character until it can decode successfully once more. This behavior is slightly different from python and rust.
Reproducer
The byte array: [xf0,x90,x28,xbc]
which contains the valid '(' character
-- Haskell
decodeUtf8Lenient (Data.ByteString.pack "\xf0\x90\x28\xbc") == "��(�"
# Python 3.10.9
b"\xf0\x90\x28\xbc".decode('utf8', errors='replace') == '�(�'
// Rust
// using the crate utf-8 v0.7.6 (https://docs.rs/utf-8/latest/utf8/)
// the interface is a bit weird, every byte is printed to stdout
let mut decoder = LossyDecoder::new(|x| { dbg!(x); });
decoder.feed(&[0xF0,0x90,0x28,0xbc]);
prints:
[src/main.rs:4] x = ""
[src/main.rs:4] x = "�"
[src/main.rs:4] x = "("
[src/main.rs:4] x = "�"
[src/main.rs:4] x = ""
Hypothesis about the cause
I believe the python and rust implementation will emit U+FFFD once when encountering an invalid byte, and then skip all continuation bytes (of the form 10xx_xxxx) before continuing. The current text
implementation seems pop just the 1 invalid byte and then continue parsing, encounter an erroneous continuation byte and then emit another replacement character etc.
Meta discussion about correctness
I don't think that it is well specified what the correct behavior is. I am only making the argument that is preferable to match other languages.
This is a breaking change and I'm not convinced that it's justified. I mean, malformed UTF-8 is an exceptional case and any sort of lenient decoding is kinda best possible effort for various definitions of "best".
The new interface for incremental decoding allows you to implement any desirable strategy.
Loud and clear! Thanks for your input, closing