error occurs in `compute_domain_idxs`
BeachWang opened this issue · comments
Thanks, just made a commit (a085821) that should fix it. Let me know if there are more issues.
Btw, I think this pipeline could be run much faster using the library code in the outer directory - after preprocessing into chunks, you can use HashedNgramDSIR
on that preprocessed data with the tokenizer=word_tokenize
argument. The quality filter can be implemented by overriding get_perexample_metadata
and perexample_metadata_filter
(also in my experience, the default length filter that's already implemented is the main one that has an effect).