haskell / text

Haskell library for space- and time-efficient operations over Unicode text.

Home Page:http://hackage.haskell.org/package/text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

decodeUtf8Lenient emits too many replacement characters (at least more than python and rust)

HugoPeters1024 opened this issue · comments

Summary

I encountered this discrepancy whilst working on a custom decodeUtf8Lenient function as a workaround for haskell/bytestring#575.

decodeUtf8Lenient is a complete function because when it encounters invalid bytes it replaces them with U+FFFD (�). It will however keep replacing all bytes with that replacement character until it can decode successfully once more. This behavior is slightly different from python and rust.

Reproducer

The byte array: [xf0,x90,x28,xbc] which contains the valid '(' character

-- Haskell 
decodeUtf8Lenient (Data.ByteString.pack "\xf0\x90\x28\xbc") == "��(�" 
# Python 3.10.9
b"\xf0\x90\x28\xbc".decode('utf8', errors='replace') == '�(�'
// Rust
// using the crate utf-8 v0.7.6 (https://docs.rs/utf-8/latest/utf8/)
// the interface is a bit weird, every byte is printed to stdout
let mut decoder = LossyDecoder::new(|x| { dbg!(x); });
decoder.feed(&[0xF0,0x90,0x28,0xbc]);

prints:

  [src/main.rs:4] x = ""                                                                                                                                      
  [src/main.rs:4] x = "�"                                                                                                                                     
  [src/main.rs:4] x = "("                                                                                                                                     
  [src/main.rs:4] x = "�"                                                                                                                                     
  [src/main.rs:4] x = ""

Hypothesis about the cause

I believe the python and rust implementation will emit U+FFFD once when encountering an invalid byte, and then skip all continuation bytes (of the form 10xx_xxxx) before continuing. The current text implementation seems pop just the 1 invalid byte and then continue parsing, encounter an erroneous continuation byte and then emit another replacement character etc.

Meta discussion about correctness

I don't think that it is well specified what the correct behavior is. I am only making the argument that is preferable to match other languages.

This is a breaking change and I'm not convinced that it's justified. I mean, malformed UTF-8 is an exceptional case and any sort of lenient decoding is kinda best possible effort for various definitions of "best".

The new interface for incremental decoding allows you to implement any desirable strategy.

Loud and clear! Thanks for your input, closing