opuscleaner-clean hangs on CCMatrix

Question

opuscleaner-clean hangs on CCMatrix

eu9ene opened this issue 10 months ago · comments

There are some errors, but it doesn't stop, instead, it appears to be hanging. Even if the config is incorrect I would expect it to exit. The jobs for other datasets have finished successfully and I don't see any errors in their logs.

OpusCleaner version is the latest main (git+https://github.com/hplt-project/OpusCleaner.git@3e258ea369c790b4e0697048f237179286b46e61)
Dataset: opus_CCMatrix/v1
Lang pair: en-ru
Cleaning config:


{
  "version": 1,
  "files": [],
  "filters": [
    {
      "filter": "deescape-special-chars",
      "parameters": {
        "LANG1": "en"
      },
      "language": "en"
    },
    {
      "filter": "deescape-special-chars",
      "parameters": {
        "LANG1": "other"
      },
      "language": "ru"
    },
    {
      "filter": "remove_empty_lines",
      "parameters": {},
      "language": null
    },
    {
      "filter": "fix_quotes",
      "parameters": {},
      "language": null
    },
    {
      "filter": "num_mismatch",
      "parameters": {
        "RATIO": 1,
        "DEBUG": false
      },
      "language": null
    },
    {
      "filter": "remove_frequent_patterns",
      "parameters": {
        "PATTERN_FILE": "/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt"
      },
      "language": null
    },
    {
      "filter": "src_trg_ratio",
      "parameters": {
        "RATIO": 0.6,
        "LOG": true
      },
      "language": null
    },
    {
      "filter": "max_word_length",
      "parameters": {
        "MAXWORDLENGTH": 150
      },
      "language": null
    },
    {
      "filter": "max_length",
      "parameters": {
        "MAXLENGTH": 150,
        "MINLENGTH": 1
      },
      "language": null
    },
    {
      "filter": "alpha_ratio",
      "parameters": {
        "LANG1": "en",
        "LANG2": "ru",
        "SRCWORDRAT": 0.4,
        "TRGWORDRAT": 0.4,
        "SRCALPHARAT": 0.5,
        "TRGALPHARAT": 0.5,
        "DEBUG": false
      },
      "language": null
    },
    {
      "filter": "fasttext_filter",
      "parameters": {
        "FASTTEXT_MODEL_TYPE": "large",
        "LANG1": "en",
        "LANG2": "ru"
      },
      "language": null
    }
  ]

Log:

...
[run.py] step 47/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 47/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 47/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
Waiting for 11 subprocesses to finish...
[run.py] step 47/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] Wrote 50000 lines to batch 152: /tmp/tmp9lj7sbca
[run.py] 0/0/deescape-special-chars exited with status code 0
[run.py] Wrote 50000 lines to batch 153: /tmp/tmp3ntpameo
[run.py] 0/1/deescape-special-chars exited with status code 0
[run.py] 0/2/remove_empty_lines exited with status code 0
[run.py] Wrote 50000 lines to batch 154: /tmp/tmprnd3kbh2
[run.py] 0/3/fix_quotes exited with status code 0
[run.py] 0/4/num_mismatch exited with status code 0
[run.py] 0/5/remove_frequent_patterns exited with status code 0
[run.py] 0/6/src_trg_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 155: /tmp/tmp5ryevhp7
[run.py] 0/7/max_word_length exited with status code 0
[run.py] 0/8/max_length exited with status code 0
[run.py] 0/9/alpha_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 156: /tmp/tmpj6j6jrgc
[run.py] 0/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmpdb4m2d9e to /tmp/tmpdfa1r7u3
[run.py] step 0/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 0/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 0/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 0/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 0/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 0/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 0/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] step 0/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] Wrote 50000 lines to batch 157: /tmp/tmpmiag6s7a
[run.py] step 0/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 0/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
Waiting for 11 subprocesses to finish...
[run.py] step 0/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] 10/0/deescape-special-chars exited with status code 0
[run.py] Wrote 50000 lines to batch 158: /tmp/tmp_pk85gbs
[run.py] 10/1/deescape-special-chars exited with status code 0
[run.py] 10/2/remove_empty_lines exited with status code 0
[run.py] 10/3/fix_quotes exited with status code 0
[run.py] 10/4/num_mismatch exited with status code 0
[run.py] Wrote 50000 lines to batch 159: /tmp/tmpa_mkpkqm
[run.py] 10/5/remove_frequent_patterns exited with status code 0
[run.py] 10/6/src_trg_ratio exited with status code 0
[run.py] 10/7/max_word_length exited with status code 0
[run.py] 10/8/max_length exited with status code 0
[run.py] 10/9/alpha_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 160: /tmp/tmpo5bk6wnd
[run.py] 10/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmpbhr2qkr7 to /tmp/tmp5ni3em2o
[run.py] step 10/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 10/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 10/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 10/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 10/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 10/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 10/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] Wrote 50000 lines to batch 161: /tmp/tmpze3nrbxe
[run.py] step 10/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 10/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 10/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
Waiting for 11 subprocesses to finish...
[run.py] step 10/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] 25/0/deescape-special-chars exited with status code 0
[run.py] Wrote 50000 lines to batch 162: /tmp/tmp9wh3czrs
[run.py] 18/0/deescape-special-chars exited with status code 0
[run.py] 25/1/deescape-special-chars exited with status code 0
[run.py] Wrote 50000 lines to batch 163: /tmp/tmpcmwxwen2
[run.py] 25/2/remove_empty_lines exited with status code 0
[run.py] 25/3/fix_quotes exited with status code 0
[run.py] 18/1/deescape-special-chars exited with status code 0
[run.py] 25/4/num_mismatch exited with status code 0
[run.py] 18/2/remove_empty_lines exited with status code 0
[run.py] 25/5/remove_frequent_patterns exited with status code 0
[run.py] 25/6/src_trg_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 164: /tmp/tmpeoog0yso
[run.py] 18/3/fix_quotes exited with status code 0
[run.py] 25/7/max_word_length exited with status code 0
[run.py] 18/4/num_mismatch exited with status code 0
[run.py] 18/5/remove_frequent_patterns exited with status code 0
[run.py] 25/8/max_length exited with status code 0
[run.py] 18/6/src_trg_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 165: /tmp/tmp7o32om6x
[run.py] 25/9/alpha_ratio exited with status code 0
[run.py] 18/7/max_word_length exited with status code 0
[run.py] 18/8/max_length exited with status code 0
[run.py] 18/9/alpha_ratio exited with status code 0
[run.py] 25/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmpik4rka5e to /tmp/tmp7yb2q_9t
[run.py] Wrote 50000 lines to batch 166: /tmp/tmprsp3zwz5
[run.py] step 25/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 25/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] 18/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmp3wxjc4sn to /tmp/tmpk2o9jinx
[run.py] step 25/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 25/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 18/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 25/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 18/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 25/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 18/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 25/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] step 18/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 25/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 18/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 25/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 18/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 25/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
[run.py] step 18/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
Waiting for 11 subprocesses to finish...
[run.py] step 25/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] step 18/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 18/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 18/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
[run.py] 30/0/deescape-special-chars exited with status code 0
Waiting for 11 subprocesses to finish...
[run.py] step 18/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] Wrote 50000 lines to batch 167: /tmp/tmpmutcejmo
[run.py] 30/1/deescape-special-chars exited with status code 0
[run.py] 30/2/remove_empty_lines exited with status code 0
[run.py] 30/3/fix_quotes exited with status code 0
[run.py] 30/4/num_mismatch exited with status code 0
[run.py] 30/5/remove_frequent_patterns exited with status code 0
[run.py] Wrote 50000 lines to batch 168: /tmp/tmpjdpn7oyg
[run.py] 30/6/src_trg_ratio exited with status code 0
[run.py] 30/7/max_word_length exited with status code 0
[run.py] 30/8/max_length exited with status code 0
[run.py] 30/9/alpha_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 169: /tmp/tmpv71vjweo
[run.py] 30/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmp4icox4zo to /tmp/tmp0re6aiy1
[run.py] step 30/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] 27/0/deescape-special-chars exited with status code 0
[run.py] step 30/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 30/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 30/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 30/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 30/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 30/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] step 30/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 30/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] 27/1/deescape-special-chars exited with status code 0
[run.py] step 30/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
Waiting for 11 subprocesses to finish...
[run.py] step 30/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] Wrote 50000 lines to batch 170: /tmp/tmpwkbzwyif
[run.py] 27/2/remove_empty_lines exited with status code 0
[run.py] 27/3/fix_quotes exited with status code 0
[run.py] 27/4/num_mismatch exited with status code 0
[run.py] 27/5/remove_frequent_patterns exited with status code 0
[run.py] Wrote 50000 lines to batch 171: /tmp/tmpehtr2ldp
[run.py] 27/6/src_trg_ratio exited with status code 0
[run.py] 27/7/max_word_length exited with status code 0
[run.py] 27/8/max_length exited with status code 0
[run.py] 27/9/alpha_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 172: /tmp/tmp3k2v0y1_
[run.py] 27/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmpdqtq73zu to /tmp/tmpajt85ytd
[run.py] step 27/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 27/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 27/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 27/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 27/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 27/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 27/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] step 27/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 27/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 27/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
Waiting for 11 subprocesses to finish...
[run.py] step 27/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] Wrote 50000 lines to batch 173: /tmp/tmp3mm2ycbu
[run.py] Wrote 50000 lines to batch 174: /tmp/tmpl6k1g3md
[run.py] Wrote 50000 lines to batch 175: /tmp/tmp0l_xqzeq
[run.py] Wrote 50000 lines to batch 176: /tmp/tmpfdhqz7q0
[run.py] Wrote 50000 lines to batch 177: /tmp/tmppif2btyh
[run.py] Wrote 50000 lines to batch 178: /tmp/tmphvz5dg_0
[run.py] Wrote 50000 lines to batch 179: /tmp/tmpm4v856xu
[run.py] Wrote 50000 lines to batch 180: /tmp/tmpmdw4w_07
[run.py] Wrote 50000 lines to batch 181: /tmp/tmpxjz1owpk
[run.py] 40/0/deescape-special-chars exited with status code 0
[run.py] Wrote 50000 lines to batch 182: /tmp/tmpmap562_7
[run.py] 40/1/deescape-special-chars exited with status code 0
[run.py] 40/2/remove_empty_lines exited with status code 0
[run.py] 40/3/fix_quotes exited with status code 0
[run.py] Wrote 50000 lines to batch 183: /tmp/tmp8f8gxxfw
[run.py] 40/4/num_mismatch exited with status code 0
[run.py] 40/5/remove_frequent_patterns exited with status code 0
[run.py] 40/6/src_trg_ratio exited with status code 0
[run.py] 40/7/max_word_length exited with status code 0
[run.py] 44/0/deescape-special-chars exited with status code 0
[run.py] 40/8/max_length exited with status code 0
[run.py] Wrote 50000 lines to batch 184: /tmp/tmpnsecakhm
[run.py] 40/9/alpha_ratio exited with status code 0
[run.py] 40/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmp275mspvt to /tmp/tmp4jo9nnb4
[run.py] step 40/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] 44/1/deescape-special-chars exited with status code 0
[run.py] step 40/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 40/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] 44/2/remove_empty_lines exited with status code 0
[run.py] step 40/3/fix_quotes: Started ./fix_quotes.py
[run.py] Wrote 50000 lines to batch 185: /tmp/tmp1wx7f1bw
[run.py] step 40/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] 44/3/fix_quotes exited with status code 0
[run.py] step 40/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 40/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] 44/4/num_mismatch exited with status code 0
[run.py] step 40/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 40/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 40/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
[run.py] 44/5/remove_frequent_patterns exited with status code 0
Waiting for 11 subprocesses to finish...
[run.py] step 40/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] 44/6/src_trg_ratio exited with status code 0
[run.py] 44/7/max_word_length exited with status code 0
[run.py] Wrote 50000 lines to batch 186: /tmp/tmphj98jrf2
[run.py] 44/8/max_length exited with status code 0
[run.py] 44/9/alpha_ratio exited with status code 0
[run.py] 44/10/fasttext_filter exited with status code 0
[run.py] Filtering chunk /tmp/tmph59wvj7l to /tmp/tmp0afaj05a
[run.py] step 44/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 44/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 44/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 44/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 44/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] step 44/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] step 44/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] step 44/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 44/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 44/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
[run.py] Wrote 50000 lines to batch 187: /tmp/tmpb39mlm8x
Waiting for 11 subprocesses to finish...
[run.py] step 44/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] Wrote 50000 lines to batch 188: /tmp/tmp6qf23orn
[run.py] Wrote 50000 lines to batch 189: /tmp/tmpioopj72y
[run.py] Wrote 50000 lines to batch 190: /tmp/tmp44jntqg0
[run.py] Wrote 50000 lines to batch 191: /tmp/tmpanbcsmzt
[run.py] Wrote 50000 lines to batch 192: /tmp/tmp7n76xk7y
[run.py] 10/0/deescape-special-chars exited with status code 0
[run.py] 10/1/deescape-special-chars exited with status code 0
[run.py] 10/2/remove_empty_lines exited with status code 0
[10/4/num_mismatch] Traceback (most recent call last):
[10/4/num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[10/4/num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[10/4/num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[10/4/num_mismatch]     assert len(cols) >= 2
[10/4/num_mismatch] AssertionError
[run.py] 10/3/fix_quotes exited with status code 0
[run.py] 10/4/num_mismatch exited with status code 1
[run.py] 10/5/remove_frequent_patterns exited with status code 0
[run.py] 10/6/src_trg_ratio exited with status code 0
[run.py] 10/7/max_word_length exited with status code 0
[run.py] 10/8/max_length exited with status code 0
[run.py] 10/9/alpha_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 193: /tmp/tmpx82vg91t
[run.py] 10/10/fasttext_filter exited with status code 0
[run.py] Wrote 50000 lines to batch 194: /tmp/tmp_gwcl2dm
[run.py] Wrote 50000 lines to batch 195: /tmp/tmpvoasw3a4
[run.py] Wrote 50000 lines to batch 196: /tmp/tmpeasb9gqp
[run.py] Wrote 50000 lines to batch 197: /tmp/tmph3f6gnm7
[run.py] Wrote 50000 lines to batch 198: /tmp/tmpk31kn2oo
[run.py] Wrote 50000 lines to batch 199: /tmp/tmplaplpupr
[run.py] Wrote 50000 lines to batch 200: /tmp/tmpf343ddda
[run.py] 30/0/deescape-special-chars exited with status code 0
[30/4/num_mismatch] Traceback (most recent call last):
[30/4/num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[30/4/num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[30/4/num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[30/4/num_mismatch]     assert len(cols) >= 2
[30/4/num_mismatch] AssertionError
[run.py] 30/2/remove_empty_lines exited with status code 0
[run.py] 30/1/deescape-special-chars exited with status code 0
[run.py] 30/3/fix_quotes exited with status code 0
[run.py] 30/4/num_mismatch exited with status code 1
[run.py] 30/5/remove_frequent_patterns exited with status code 0
[run.py] 30/6/src_trg_ratio exited with status code 0
[run.py] 30/7/max_word_length exited with status code 0
[run.py] 30/8/max_length exited with status code 0
[run.py] 30/9/alpha_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 201: /tmp/tmpqt23u_09
[run.py] 30/10/fasttext_filter exited with status code 0
[run.py] 2/0/deescape-special-chars exited with status code 0
[run.py] Wrote 50000 lines to batch 202: /tmp/tmpg5u8cz0x
[run.py] 2/1/deescape-special-chars exited with status code 0
[run.py] 2/2/remove_empty_lines exited with status code 0
[run.py] 47/0/deescape-special-chars exited with status code 0
[run.py] 2/3/fix_quotes exited with status code 0
[run.py] Wrote 50000 lines to batch 203: /tmp/tmpwoksewbr
[run.py] 2/4/num_mismatch exited with status code 0
[run.py] 2/5/remove_frequent_patterns exited with status code 0
[run.py] 2/6/src_trg_ratio exited with status code 0
[run.py] 47/1/deescape-special-chars exited with status code 0
[run.py] 47/2/remove_empty_lines exited with status code 0
[run.py] 2/7/max_word_length exited with status code 0
[run.py] 47/3/fix_quotes exited with status code 0
[run.py] 2/8/max_length exited with status code 0
[run.py] Wrote 50000 lines to batch 204: /tmp/tmp0okt78wo
[run.py] 47/4/num_mismatch exited with status code 0
[run.py] 2/9/alpha_ratio exited with status code 0
[run.py] 47/5/remove_frequent_patterns exited with status code 0
[run.py] 47/6/src_trg_ratio exited with status code 0
[run.py] 47/7/max_word_length exited with status code 0
[run.py] 47/8/max_length exited with status code 0
[run.py] 2/10/fasttext_filter exited with status code 0
[run.py] 47/9/alpha_ratio exited with status code 0
[run.py] Filtering chunk /tmp/tmpkticru69 to /tmp/tmpknn8pb53
[run.py] Wrote 50000 lines to batch 205: /tmp/tmppeazfky_
[run.py] step 2/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] step 2/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 2/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 2/3/fix_quotes: Started ./fix_quotes.py
[run.py] 47/10/fasttext_filter exited with status code 0
[run.py] step 2/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
[run.py] Filtering chunk /tmp/tmpbrjks70z to /tmp/tmpucxb_kje
[run.py] step 2/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] 0/0/deescape-special-chars exited with status code 0
[0/4/num_mismatch] Traceback (most recent call last):
[0/4/num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[0/4/num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[0/4/num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[0/4/num_mismatch]     assert len(cols) >= 2
[0/4/num_mismatch] AssertionError
[run.py] 0/2/remove_empty_lines exited with status code 0
[run.py] step 47/0/deescape-special-chars: Started LANG1=en; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 0 ./deescape-special-chars.perl -l $LANG1
[run.py] 0/1/deescape-special-chars exited with status code 0
[run.py] 0/3/fix_quotes exited with status code 0
[run.py] step 2/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] 0/4/num_mismatch exited with status code 1
[run.py] step 47/1/deescape-special-chars: Started LANG1=other; /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/bin/python /data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/col.py 1 ./deescape-special-chars.perl -l $LANG1
[run.py] step 2/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] 0/5/remove_frequent_patterns exited with status code 0
[run.py] step 47/2/remove_empty_lines: Started grep -vE '^\s*\t|\t\s*$'
[run.py] step 2/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] step 47/3/fix_quotes: Started ./fix_quotes.py
[run.py] step 2/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
[run.py] 0/6/src_trg_ratio exited with status code 0
[run.py] step 47/4/num_mismatch: Started RATIO=1.0; DEBUG=''; ./num_mismatch.py ${DEBUG:+--debug} --ratio $RATIO
Waiting for 11 subprocesses to finish...
[run.py] step 2/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] step 47/5/remove_frequent_patterns: Started PATTERN_FILE=/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt; ./remove_frequent_patterns.py --pattern-file $PATTERN_FILE ${DEBUG:+--debug}
[run.py] 0/7/max_word_length exited with status code 0
[run.py] step 47/6/src_trg_ratio: Started RATIO=0.6; LOG=1; ./src_trg_ratio.py ${LOG:+--log} --ratio-length $RATIO
[run.py] step 47/7/max_word_length: Started MAXWORDLENGTH=150; ./max_word_length.py --max-word-length $MAXWORDLENGTH
[run.py] step 47/8/max_length: Started MAXLENGTH=150; MINLENGTH=1; ./max_length.py --max-length $MAXLENGTH --min-length $MINLENGTH
[run.py] 0/8/max_length exited with status code 0
[run.py] step 47/9/alpha_ratio: Started LANG1=en; LANG2=ru; SRCWORDRAT=0.4; TRGWORDRAT=0.4; SRCALPHARAT=0.5; TRGALPHARAT=0.5; DEBUG=''; ./alpha_ratio.py --src-lang $LANG1 ${LANG2:+--trg-lang $LANG2} --ratio-words-src $SRCWORDRAT --ratio-words-trg $TRGWORDRAT --ratio-alpha-src $SRCALPHARAT --ratio-alpha-trg $TRGALPHARAT ${DEBUG:+--debug}
Waiting for 11 subprocesses to finish...
[run.py] step 47/10/fasttext_filter: Started FASTTEXT_MODEL_TYPE=large; LANG1=en; LANG2=ru; ./fasttext_filter.py --source-lang $LANG1 --target-lang $LANG2 --model-type $FASTTEXT_MODEL_TYPE ${DEBUG:+--debug}
[run.py] 0/9/alpha_ratio exited with status code 0
[run.py] Wrote 50000 lines to batch 206: /tmp/tmpplfwao91
[run.py] 0/10/fasttext_filter exited with status code 0
Waiting for 0 subprocesses to finish...
[run.py] Wrote 50000 lines to batch 207: /tmp/tmpr1mdp8it
Traceback (most recent call last):
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/clean.py", line 530, in main
    run_parallel(pipeline, stdin, stdout, print_queue=print_queue, parallel=args.parallel, batch_size=args.batch_size)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/clean.py", line 421, in run_parallel
    runner.join()
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/_util.py", line 28, in join
    raise self.exception
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/_util.py", line 21, in run
    super().run()
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/clean.py", line 335, in run_pipeline
    pipeline.run(pool, stdin, stdout)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/e4e5bcc3631188b470fcef56b9de8b3e/lib/python3.9/site-packages/opuscleaner/clean.py", line 183, in __exit__
    raise Exception(f"Child {(child_i + 1)} {self.children[child_i].name} exited with {retval}")
Exception: Child 5 0/4/num_mismatch exited with 1
Waiting for 11 subprocesses to finish...
Waiting for 11 subprocesses to finish...
Waiting for 11 subprocesses to finish...
Waiting for 11 subprocesses to finish...

Jelmer · Answer 1 · Mon Sep 18 2023 17:54:48 GMT+0800 (China Standard Time)

The parallel option propagates the exception thrown by the pipeline.run() call which is caused by num-mismatch crashing, but then that doesn't cause the other sections to stop processing and now it just keeps hanging on the merger that expects a chunk from that failed pipeline.run() call.

I'll add a proper fix for that and some test cases. (I've been loving the test cases in OpusTrainer, so expect a bunch of them here as well)

The crash in num_mismatch is another issue. Going by the error message, it looks like it received a line that doesn't have at least two columns. Not sure what to do about it, @eu9ene (and @XapaJIaMnu, @graemenail) what would you rather have, an invalid line in the output (c.q. num_mismatch silently ignoring the line), it being filtered out (c.q. num_mismatch silently dropping the line), or keeping the crash and just fix it by adding a filter at the front that drops lines that don't have two columns. The last option has my personal preference because it would filter out shitty data, but still cause crashes when any of the filters start introducing shitty data (which would clearly be a bug that I'd rather not miss).

Graeme Nail · Answer 2 · Mon Sep 18 2023 18:09:58 GMT+0800 (China Standard Time)

Agree with your point @jelmervdl , about some "basic consistency" filter at the front which enforces stuff like two columns being required to clean bilingual data.

As you say, filters introducing ill-formed data is an implementation issue, and should throw an error. Data that is already ill-formed should be cleaned away.

Nikolay Bogoychev · Answer 3 · Mon Sep 18 2023 18:32:19 GMT+0800 (China Standard Time)

In the fork that i use of OpusCleaner, I have added a remove empty line filter to some of the filters because of this precise issue. Some filters introduce empty lines after they are being run, so even if we supposedly have removed empty lines in some filter they could inadvertently be introduced. It becomes a bit of a whack-a-mole issue then, and as such i wonder if it should be solved on the pipeline level, or do we just add boilerplate to each individual filter?

Evgeny Pavlov · Answer 4 · Wed Sep 20 2023 05:04:05 GMT+0800 (China Standard Time)

fix it by adding a filter at the front that drops lines that don't have two columns

This solution would identify the issue explicitly and allow us to see how many lines were filtered.

Jelmer · Answer 5 · Tue Oct 03 2023 22:07:39 GMT+0800 (China Standard Time)

I've rewritten quite a bit of error handling logic in clean.py to quit earlier when an error is noticed. Also, more tests for finding errors.

Evgeny Pavlov · Answer 6 · Wed Oct 04 2023 01:00:03 GMT+0800 (China Standard Time)

@jelmervdl thank you! I confirm, it doesn't hang anymore. num_mismatch issue is still present for me, I'll file a separate bug