hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use opuscleaner-clean with stdin?

eu9ene opened this issue · comments

I'm trying something like that and it doesn't do anything:

paste <(pigz -dc data/train-parts/ELRC-3075-wikipedia_health-v1.en-ru.en.gz) \
      <(pigz -dc data/train-parts/ELRC-3075-wikipedia_health-v1.en-ru.ru.gz) \
| opuscleaner-clean --input=- data/train-parts/ELRC-3075-wikipedia_health-v1.en-ru.filters.json en ru

Am I doing it wrong?

This works:

opuscleaner-clean data/train-parts/ELRC-3075-wikipedia_health-v1.en-ru.filters.json en ru
commented

It was an indentation error, the code to run the pipeline was in the else-case that only triggers when you pass no --input 🤦