hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

num_mismatch fails on CCMatrix

eu9ene opened this issue · comments

The cleaning has been completed successfully on all other datasets and fails only on CCMatrix.

We started discussing the potential solutions here. I think if the original dataset is correct and we still have this issue, the cleaner should be able to automatically handle this. Otherwise, if the solution is to add an extra filter, we'll have random failures for some language pairs and datasets that will require manual intervention every time.

OpusCleaner version: 90a27f1

log:

opuscleaner-clean --parallel 48 --batch-size=50000 --input=- /data/rw/evgeny/data/en-ru/opuscleaner_default_filters/clean/corpus/opus_CCMatrix/v1.en-ru.filters.json en ru
+ cut -f2
+ tee /dev/fd/63
+ pigz
+ paste /dev/fd/63 /dev/fd/62
++ pigz -dc /data/rw/evgeny/data/en-ru/opuscleaner_default_filters/original/corpus/opus_CCMatrix/v1.en.gz
++ pigz -dc /data/rw/evgeny/data/en-ru/opuscleaner_default_filters/original/corpus/opus_CCMatrix/v1.ru.gz
++ cut -f1
++ pigz
[4/4:num_mismatch] Traceback (most recent call last):
[4/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[4/4:num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[4/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[4/4:num_mismatch]     assert len(cols) >= 2
[4/4:num_mismatch] AssertionError
[2/4:num_mismatch] Traceback (most recent call last):
[2/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[2/4:num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[2/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[2/4:num_mismatch]     assert len(cols) >= 2
[2/4:num_mismatch] AssertionError
[5/4:num_mismatch] Traceback (most recent call last):
[5/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[5/4:num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[5/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[5/4:num_mismatch]     assert len(cols) >= 2
[5/4:num_mismatch] AssertionError
[8/4:num_mismatch] Traceback (most recent call last):
[8/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[8/4:num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[8/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[8/4:num_mismatch]     assert len(cols) >= 2
[8/4:num_mismatch] AssertionError
[7/4:num_mismatch] Traceback (most recent call last):
[7/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 42, in <module>
[7/4:num_mismatch]     filter_numerical_mismatch(sys.stdin, sys.stdout, args.ratio, debug=args.debug)
[7/4:num_mismatch]   File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/filters/./num_mismatch.py", line 16, in filter_numerical_mismatch
[7/4:num_mismatch]     assert len(cols) >= 2
[7/4:num_mismatch] AssertionError
Traceback (most recent call last):
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/clean.py", line 414, in run_pipeline
    pipeline.run(pool, stdin, stdout, time=time)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/logging.py", line 270, in __exit__
    super().__exit__(typ, value, traceback)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/clean.py", line 235, in __exit__
    raise RuntimeError(f"Child {problem_child.name} (pid {problem_child.process.pid}) exited with {problem_child.process.returncode}")
RuntimeError: Child 4/4:num_mismatch (pid 460) exited with 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/clean.py", line 625, in main
    run_parallel(pipeline, stdin, stdout, print_queue=print_queue, parallel=args.parallel, batch_size=args.batch_size, time=args.time)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/logging.py", line 250, in wrapper
    return fn(*args, **kwargs)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/clean.py", line 503, in run_parallel
    pool.join()
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/_util.py", line 53, in join
    raise exc
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/_util.py", line 17, in _thread_pool_worker
    target(*args, **kwargs)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/logging.py", line 250, in wrapper
    return fn(*args, **kwargs)
  File "/data/rw/evgeny/bergamot-training-opuscleaner/.snakemake/conda/bd12234156942d85998204723869b2dc/lib/python3.9/site-packages/opuscleaner/clean.py", line 426, in run_pipeline
    raise RuntimeError(f'Error while processing batch {batch_index}') from exc
RuntimeError: Error while processing batch 4

config:

{
  "version": 1,
  "files": [],
  "filters": [
    {
      "filter": "deescape-special-chars",
      "parameters": {
        "LANG1": "en"
      },
      "language": "en"
    },
    {
      "filter": "deescape-special-chars",
      "parameters": {
        "LANG1": "other"
      },
      "language": "ru"
    },
    {
      "filter": "remove_empty_lines",
      "parameters": {},
      "language": null
    },
    {
      "filter": "fix_quotes",
      "parameters": {},
      "language": null
    },
    {
      "filter": "num_mismatch",
      "parameters": {
        "RATIO": 1,
        "DEBUG": false
      },
      "language": null
    },
    {
      "filter": "remove_frequent_patterns",
      "parameters": {
        "PATTERN_FILE": "/data/rw/evgeny/bergamot-training-opuscleaner/pipeline/clean/opuscleaner/configs/remove_frequent_patterns.txt"
      },
      "language": null
    },
    {
      "filter": "src_trg_ratio",
      "parameters": {
        "RATIO": 0.6,
        "LOG": true
      },
      "language": null
    },
    {
      "filter": "max_word_length",
      "parameters": {
        "MAXWORDLENGTH": 150
      },
      "language": null
    },
    {
      "filter": "max_length",
      "parameters": {
        "MAXLENGTH": 150,
        "MINLENGTH": 1
      },
      "language": null
    },
    {
      "filter": "alpha_ratio",
      "parameters": {
        "LANG1": "en",
        "LANG2": "ru",
        "SRCWORDRAT": 0.4,
        "TRGWORDRAT": 0.4,
        "SRCALPHARAT": 0.5,
        "TRGALPHARAT": 0.5,
        "DEBUG": false
      },
      "language": null
    },
    {
      "filter": "fasttext_filter",
      "parameters": {
        "FASTTEXT_MODEL_TYPE": "large",
        "LANG1": "en",
        "LANG2": "ru"
      },
      "language": null
    }
  ]
}

I was a bit confused by the mention of pigz paste even though you're using --input=- but now I've found the script where you're calling this and that makes sense.

I've not been able to replicate the issue on my side. Even with the downloaded data of ru ccmatrix, it seems to run just fine here. I used the integrated gzip + paste, not through stdin, but that's technically exactly the same.

I suspect any of the steps before to be the cause of num_mismatch complaining. I've made col.py, which wraps the first two processes in the pipeline, be even more strict about the number of columns in the input data. They'll now complain if the number of columns changes throughout their runtime. (Previously they only complained if the number of columns was insufficient for their work.)

I don't see any obvious bug in fix_quotes or remove_empty_lines.

Could you try again with current main and see whether it now crashes earlier for you?

PS: Once you go into production, I'd suggest to increase the batch size parameter quite a bit to armortise the cost of starting all the pipeline processes all the time. I'd also suggest to lower --parallel to roughly 2 * cpu_count() / filter_steps since all filters can run in parallel in theory, and you also want some cpu left for the pigz at the beginning and the end.

The alternative idea I have in mind is adding a --validate flag that basically wraps every process in a script that checks their input and output.

Basically same as col.py, but slightly different purpose. But we should never be using that in production.

With the latest main it failed with my default config with a new error: task cluster log

[task 2023-10-05T22:10:57.852Z] [1/10:fasttext_filter]   File "/usr/lib/python3/dist-packages/fasttext/FastText.py", line 98, in __init__
[task 2023-10-05T22:10:57.852Z] [1/10:fasttext_filter]     self.f.loadModel(model_path)
[task 2023-10-05T22:10:57.852Z] [1/10:fasttext_filter] ValueError: large.bin has wrong file format!

Maybe it was the original problem, we just didn't see it. What's weird is that other datasets are being cleaned ok without any errors: task group

Then I switched the fast text filter to a small model and it looks working: task cluster log
but it outputs a lot of such lines:

[2794/2:remove_empty_lines] grep: (standard input): binary file matches

I'd suggest to increase the batch size parameter quite a bit to armortise the cost of starting all the pipeline processes all the time

I thought since we process one dataset per job and if the dataset is not large, the batch size shouldn't be too large to utilize 32 cores.

all filters can run in parallel in theory

Interesting. I thought it runs them sequentially. Then indeed the batch size can be increased (I see that the default is 1M). I'm not sure what's the overhead of starting new processes though.

Anyway, it cleaned 139937785 lines CCMatrix pretty quickly. When we have the proper charts for resource utilization we'll be able to tune it better.

Huh, yeah, I wouldn't expect a filter that's waay down the pipeline to cause an input error on a filter much earlier in the pipeline.

I thought since we process one dataset per job and if the dataset is not large, the batch size shouldn't be too large to utilize 32 cores.

I need to document this better, but batch_size basically controls how many lines go into a chunk, that is then cleaned with an entire processing pipeline cycle. It works a bit like GNU Parallel, which also starts and stops the wrapped process for each chunk. Its either that, or no guarantees on the order of the output. Which… now I'm thinking about it, might not be such a bad guarantee to drop.

Interesting. I thought it runs them sequentially.

Well, yes and no. It basically starts exactly the same setup of processes as bash pigz -cd | filter1 | filter2 | pigz -c > out.gz would do. But that first pigz can be decompressing into the buffer to filter1 while filter1 is also processing its buffer into filter2, etc. So if all processes take the same amount of processing time, they could all keep running all the time. In practice, you'll have one filter that holds up all the others, hence the 2 * cpu_count() in my guesstimate.

[2794/2:remove_empty_lines] grep: (standard input): binary file matches

That's not good. I'll add --text to that grep command, but … why would grep think the data is binary? Random trash? Or does your grep not know about unicode? (Not that it should really matter, its just looking for newlines, but still…)

It's still unclear why fast text fails with ValueError: large.bin has wrong file format! . We use lid.176.bin model in our legacy cleaning script which corresponds to large and maybe we should keep it this way. I don't know how much is the difference between large and small models.

Did you use the large model for CCMatrix on ru-en in your test?

I'll open a new issue since the original issue with num_mistatch was fixed