hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

opuscleaner-clean returns status code 0 if one of the processes fail

eu9ene opened this issue · comments

I had this issue with deescape_tsv filter and Books-v1.en-ru dataset:

 opuscleaner-clean --parallel 4 --batch-size=50000 test_data/clean.en-ru.filters.json > test
[run.py] Waiting for splitter to finish
[run.py] gunzip test_data/ELRC-3075-wikipedia_health-v1.en-ru.en.gz exited with status code 0
[run.py] gunzip test_data/ELRC-3075-wikipedia_health-v1.en-ru.ru.gz exited with status code 0
[run.py] paste exited with status code 0
[run.py] Wrote 4073 lines to batch 0: /var/folders/0k/_blk67rn5m5dhbcc263rfp5w0000gn/T/tmpyago8d6c
[run.py] Waiting for pipelines to finish
[run.py] Filtering chunk /var/folders/0k/_blk67rn5m5dhbcc263rfp5w0000gn/T/tmpyago8d6c to /var/folders/0k/_blk67rn5m5dhbcc263rfp5w0000gn/T/tmp520fzno3
[run.py] step 0: Started LANG1=en; /Users/epavlov/opt/anaconda3/envs/opuscleaner/bin/python3.9 /Users/epavlov/opt/anaconda3/envs/opuscleaner/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 1: Started ./deescape_tsv.py
[run.py] step 2: Started LANG1=other; /Users/epavlov/opt/anaconda3/envs/opuscleaner/bin/python3.9 /Users/epavlov/opt/anaconda3/envs/opuscleaner/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 3: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 4: Started ./fix_quotes.py
[run.py] step 5: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 6: Started PATTERN_FILE=/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 7: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] step 8: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 9: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 10: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
Waiting for 12 subprocesses to finish...
[run.py] step 11: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] step 0 exited with status code 0
[step 1] Traceback (most recent call last):
[step 1]   File "/Users/epavlov/opt/anaconda3/envs/opuscleaner/lib/python3.9/site-packages/opuscleaner/filters/./deescape_tsv.py", line 9, in <module>
[step 1]     if field[0] == QUOTECHR and field[-1] == QUOTECHR:
[step 1] IndexError: index out of range
[run.py] step 1 exited with status code 1
[run.py] step 2 exited with status code 0
[run.py] step 3 exited with status code 0
[run.py] step 4 exited with status code 0
[run.py] step 5 exited with status code 0
[run.py] step 6 exited with status code 0
[run.py] step 7 exited with status code 0
[run.py] step 8 exited with status code 0
[run.py] step 9 exited with status code 0
[run.py] step 10 exited with status code 0
[run.py] step 11 exited with status code 0
Exception in thread Thread-9:
Traceback (most recent call last):
  File "/Users/epavlov/opt/anaconda3/envs/opuscleaner/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/epavlov/opt/anaconda3/envs/opuscleaner/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/epavlov/opt/anaconda3/envs/opuscleaner/lib/python3.9/site-packages/opuscleaner/clean.py", line 331, in run_pipeline
    pipeline.run(pool, stdin, stdout)
  File "/Users/epavlov/opt/anaconda3/envs/opuscleaner/lib/python3.9/site-packages/opuscleaner/clean.py", line 180, in __exit__
    raise Exception(f"Child {(child_i + 1)} {self.children[child_i].name} exited with {retval}")
Exception: Child 2 step 1 exited with 1
Waiting for 3 subprocesses to finish...
[run.py] Waiting for merger to finish
(opuscleaner) danmer-macbook:bergamot-training epavlov$ echo $?
0

Also, it's a separate issue that this rule fails.