hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OpusCleaner leaves `\r` in output?

jelmervdl opened this issue · comments

commented

Nick ran into some issues with spurious \r carriage return characters in the output files. That should not happen.

Bifixer should be able to fix that type of things

I don't run everything through bifixer as allegedly some of the corpora are clean. Allegedly.

Bifixer can be run on everything as it doesn't discard sentences and can fix many small things.

commented

In this case I'd say it is a bug in OpusCleaner because I've written code that reads like for line in sys.stdin: do_something(line.rstrip('\n')) to be a bit cautious to not strip any empty \t from the end of a line. But this also leaves the \r on some datasets that use Windows line endings. There's also code in there that just assumes line[:-1] will strip all the newline characters.

Edit: in case anyone wants to attempt to exploit some unexpected behaviour, splitlines() splits on a lot of stuff which is fun if it is used for say splitting headers of a HTTP response or email header. You can easily inject your own headers 😎