error occurs in `compute_domain_idxs`

Question

error occurs in `compute_domain_idxs`

BeachWang opened this issue 7 months ago · comments

Hi,

I tried to run the pipeline in experimental. However, an error occured in compute_domain_idxs when I ran the resample. It seems to be that ds_path is used as a Path or string, but it is a list.

Sang Michael Xie · Answer 1 · Sat Dec 02 2023 08:10:01 GMT+0800 (China Standard Time)

Thanks, just made a commit (a085821) that should fix it. Let me know if there are more issues.

Btw, I think this pipeline could be run much faster using the library code in the outer directory - after preprocessing into chunks, you can use HashedNgramDSIR on that preprocessed data with the tokenizer=word_tokenize argument. The quality filter can be implemented by overriding get_perexample_metadata and perexample_metadata_filter (also in my experience, the default length filter that's already implemented is the main one that has an effect).