huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Migrate word tokenizer download functions to process locked download

hynky1999 opened this issue · comments

Problem

#187 Introduced new tokenizer libraries, which will often need to download several files to work. This can however introduce a problems as the downloads are not interlocked.

Solution

Inspect the libraries and try to pre-download the files using process-locked download