Assign more cpu to single task to speed it up for local executor?
uselesscreature86 opened this issue · comments
I am using the local executor. My machine has 48 Cpus with 348 Ram. Any idea how to speed this up? Currently one single task (task=1, running for 1 warc.gz file, with size ~1g) takes half an hour. This is my executor code, borrowed from the fineweb example. Also, I have 200 warc.gz files to process. Is setting tasks = 200 the correct way?
main_processing_executor = LocalPipelineExecutor(
pipeline=[
WarcReader(
f"tur_subsubset",
compression="gzip",
glob_pattern="*.warc.gz",
),
URLFilter(),
Trafilatura(favour_precision=True, timeout=10),
LanguageFilter(languages=(Languages.turkish)),
GopherRepetitionFilter(),
GopherQualityFilter(),
C4QualityFilter(filter_no_terminal_punct=False),
FineWebQualityFilter(),
JsonlWriter(f"{FILTERING_OUTPUT_PATH}/output/{DUMP_TO_PROCESS}"),
],
tasks=200,
workers=44,
logging_dir=f"{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP_TO_PROCESS}",
)
This should be the most optimized setup yes. You can optionally increase workers
a little bit (depending on what the remaining 4 cpus are busy doing).
Do note that we do not recommend using the default (english) values for GopherQualityFilter
and FineWebQualityFilter
if you are processing Turkish data. You should probably tune/adapt the options of those blocks to your language
Thank you!!!!!
Also, how to set the parameter tasks? For instance, if I have 10000 files, should I set tasks = 10000?
You can yes. If tasks > nb of files, than the excess tasks will not perform any work as we do not currently split files
How about the case where tasks < nb of files? Will all the files be processed? Will the execution speed be faster?