Preprocessor breaks on non-UTF8 input
qwertystop opened this issue · comments
The preprocessor script fails to run when the input file is not valid UTF8. Given that the algorithms involved, to the best of my understanding, ought to work on any ordered collection of bytes, this seems like a bug.
Use --encoding bytes
to switch to byte encoding. This isn't properly supported in the sampling script, but that's easy to fix.
Well, that'll teach me to file issues at two-in-the-morning. Thanks. Not closing this yet; I have an idea to make the script work for multi-byte "characters" of no particular encoding. Also, I can try to put together a PR for the sampling script to support bytes.
Never mind prior comment; closing this, any such improvement would be a PR.