Decouple lang score and lang filter
jordane95 opened this issue · comments
jordane95 commented
Sometimes we may want to filter difference languages from a set. In this case, we do not have to re-score each time with a fastest model
Guilherme Penedo commented
We already support passing a list of languages to keep in the languages
parameter
jordane95 commented
May be we want to further split the resulting data by language? In that case, wondn't all language be mixed together?
Guilherme Penedo commented
If you mean when you save the data to disk, you can do this by adding the language tag to output_filename
. See the example in the readme: https://github.com/huggingface/datatrove?tab=readme-ov-file#saving-data