p-lambda / dsir

DSIR large-scale data selection framework for language model training

Home Page:https://arxiv.org/abs/2302.03169

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error occurs in `compute_domain_idxs`

BeachWang opened this issue · comments

Hi,

I tried to run the pipeline in experimental. However, an error occured in compute_domain_idxs when I ran the resample. It seems to be that ds_path is used as a Path or string, but it is a list.

截屏2023-11-30 下午12 38 36

Thanks, just made a commit (a085821) that should fix it. Let me know if there are more issues.

Btw, I think this pipeline could be run much faster using the library code in the outer directory - after preprocessing into chunks, you can use HashedNgramDSIR on that preprocessed data with the tokenizer=word_tokenize argument. The quality filter can be implemented by overriding get_perexample_metadata and perexample_metadata_filter (also in my experience, the default length filter that's already implemented is the main one that has an effect).