huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Local fasttext model

jordane95 opened this issue · comments

It seems that current fasttext filter can only load model from remote url. Is it possible to support loading model from a local path?

It should also work with a local path, it should copy the model to the HF cache folder in that case I believe

File "/output/datatrove/src/datatrove/pipeline/filters/fasttext_filter.py", line 67, in filter
labels, scores = self.model.predict(doc.text.replace("\n", ""))
File "/output/datatrove/src/datatrove/pipeline/filters/fasttext_filter.py", line 63, in model
self._model = _FastText(model_file)
File "/opt/conda/envs/datatrove/lib/python3.10/site-packages/fasttext/FastText.py", line 98, in __init__
self.f.loadModel(model_path)
ValueError: /root/.cache/huggingface/assets/datatrove/filters/fasttext/_data_math_filter_train_cls_models_fasttext_math.bin has wrong file format!

I guess this error may relate to some problems in distributed setting? i.e., multiple workers write to one path

We have fixed asset loading/downloading by adding file locks in #155