huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Assign more cpu to single task to speed it up for local executor?

uselesscreature86 opened this issue · comments

commented

I am using the local executor. My machine has 48 Cpus with 348 Ram. Any idea how to speed this up? Currently one single task (task=1, running for 1 warc.gz file, with size ~1g) takes half an hour. This is my executor code, borrowed from the fineweb example. Also, I have 200 warc.gz files to process. Is setting tasks = 200 the correct way?

main_processing_executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader(
            f"tur_subsubset",
            compression="gzip",
            glob_pattern="*.warc.gz",  
        ),
        URLFilter(),
        Trafilatura(favour_precision=True, timeout=10),
        LanguageFilter(languages=(Languages.turkish)),
        GopherRepetitionFilter(),
        GopherQualityFilter(),
        C4QualityFilter(filter_no_terminal_punct=False),
        FineWebQualityFilter(),
        JsonlWriter(f"{FILTERING_OUTPUT_PATH}/output/{DUMP_TO_PROCESS}"),
    ],
    
    tasks=200,
    workers=44,
    logging_dir=f"{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP_TO_PROCESS}",
)

This should be the most optimized setup yes. You can optionally increase workers a little bit (depending on what the remaining 4 cpus are busy doing).
Do note that we do not recommend using the default (english) values for GopherQualityFilter and FineWebQualityFilter if you are processing Turkish data. You should probably tune/adapt the options of those blocks to your language

commented

Thank you!!!!!

commented

Also, how to set the parameter tasks? For instance, if I have 10000 files, should I set tasks = 10000?

You can yes. If tasks > nb of files, than the excess tasks will not perform any work as we do not currently split files

commented

How about the case where tasks < nb of files? Will all the files be processed? Will the execution speed be faster?