jcjohnson / torch-rnn

Efficient, reusable RNNs and LSTMs for torch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preprocessor breaks on non-UTF8 input

qwertystop opened this issue · comments

The preprocessor script fails to run when the input file is not valid UTF8. Given that the algorithms involved, to the best of my understanding, ought to work on any ordered collection of bytes, this seems like a bug.

Use --encoding bytes to switch to byte encoding. This isn't properly supported in the sampling script, but that's easy to fix.

Well, that'll teach me to file issues at two-in-the-morning. Thanks. Not closing this yet; I have an idea to make the script work for multi-byte "characters" of no particular encoding. Also, I can try to put together a PR for the sampling script to support bytes.

Never mind prior comment; closing this, any such improvement would be a PR.