hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FileNotFoundError: [WinError 2] The system cannot find the file specified

tomsbergmanis opened this issue · comments

Hi Jelmer and others,
After Nick's speech, I tried out OpusCleaner on Windows 11, but after successfully downloading corpora I was unable to filter it as the view which should have returned a sample of sentences was empty.
image

As I couldn't figure out what was wrong I added error.log

I would appreciate some help to get past this as the tool looks amazing and I would like to introduce it at our company!
Thanks in advance,
Toms

User is running on windows and we fail to find gzip. How sad...

Supporting windows has not been a focus so far.

I made the choice to basically use unix pipes for all the actual processing. I.e. use gzip as a stand-alone process instead of trying to write performant python loops.

Just to get an idea of your use case:

  • would you run OpusCleaner just to create the cleaning configurations, or are you also cleaning your actual datasets on Windows?
  • Would you be okay with using it in the Linux for Windows subsystem?
  • or as a Docker container?

I would be happy with any Windows-compatible solution as that would allow us to use it on our company's desktop machines, which, unfortunately, are Windows. I will try Windows Subsystem for Linux!
Cheers!