Assign more cpu to single task to speed it up for local executor?

Question

Assign more cpu to single task to speed it up for local executor?

uselesscreature86 opened this issue 2 months ago · comments

I am using the local executor. My machine has 48 Cpus with 348 Ram. Any idea how to speed this up? Currently one single task (task=1, running for 1 warc.gz file, with size ~1g) takes half an hour. This is my executor code, borrowed from the fineweb example. Also, I have 200 warc.gz files to process. Is setting tasks = 200 the correct way?

main_processing_executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader(
            f"tur_subsubset",
            compression="gzip",
            glob_pattern="*.warc.gz",  
        ),
        URLFilter(),
        Trafilatura(favour_precision=True, timeout=10),
        LanguageFilter(languages=(Languages.turkish)),
        GopherRepetitionFilter(),
        GopherQualityFilter(),
        C4QualityFilter(filter_no_terminal_punct=False),
        FineWebQualityFilter(),
        JsonlWriter(f"{FILTERING_OUTPUT_PATH}/output/{DUMP_TO_PROCESS}"),
    ],
    
    tasks=200,
    workers=44,
    logging_dir=f"{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP_TO_PROCESS}",
)

Guilherme Penedo · Answer 1 · Wed Jun 12 2024 15:35:34 GMT+0800 (China Standard Time)

This should be the most optimized setup yes. You can optionally increase workers a little bit (depending on what the remaining 4 cpus are busy doing).
Do note that we do not recommend using the default (english) values for GopherQualityFilter and FineWebQualityFilter if you are processing Turkish data. You should probably tune/adapt the options of those blocks to your language

>: · Answer 2 · Thu Jun 13 2024 02:47:19 GMT+0800 (China Standard Time)

Thank you!!!!!

>: · Answer 3 · Thu Jun 13 2024 05:29:57 GMT+0800 (China Standard Time)

Also, how to set the parameter tasks? For instance, if I have 10000 files, should I set tasks = 10000?

Guilherme Penedo · Answer 4 · Thu Jun 13 2024 16:21:51 GMT+0800 (China Standard Time)

You can yes. If tasks > nb of files, than the excess tasks will not perform any work as we do not currently split files

>: · Answer 5 · Fri Jun 28 2024 05:53:29 GMT+0800 (China Standard Time)

How about the case where tasks < nb of files? Will all the files be processed? Will the execution speed be faster?