huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decouple lang score and lang filter

jordane95 opened this issue · comments

Sometimes we may want to filter difference languages from a set. In this case, we do not have to re-score each time with a fastest model

We already support passing a list of languages to keep in the languages parameter

May be we want to further split the resulting data by language? In that case, wondn't all language be mixed together?

If you mean when you save the data to disk, you can do this by adding the language tag to output_filename. See the example in the readme: https://github.com/huggingface/datatrove?tab=readme-ov-file#saving-data