Preprocessor breaks on non-UTF8 input

Question

Preprocessor breaks on non-UTF8 input

qwertystop opened this issue 7 years ago · comments

The preprocessor script fails to run when the input file is not valid UTF8. Given that the algorithms involved, to the best of my understanding, ought to work on any ordered collection of bytes, this seems like a bug.

antihutka · Answer 1 · Sun Apr 02 2017 14:41:07 GMT+0800 (China Standard Time)

Use --encoding bytes to switch to byte encoding. This isn't properly supported in the sampling script, but that's easy to fix.

Qwertystop · Answer 2 · Sun Apr 02 2017 21:57:05 GMT+0800 (China Standard Time)

Well, that'll teach me to file issues at two-in-the-morning. Thanks. Not closing this yet; I have an idea to make the script work for multi-byte "characters" of no particular encoding. Also, I can try to put together a PR for the sampling script to support bytes.

Qwertystop · Answer 3 · Wed Apr 26 2017 23:50:39 GMT+0800 (China Standard Time)

Never mind prior comment; closing this, any such improvement would be a PR.