THUDM / GLM

GLM (General Language Model)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

self.num_samples = 1000 * self.ds_len

superhg opened this issue · comments

这里数据集里面 self.num_samples = 1000 * self.ds_len 为什么乘1000?

class BlockDataset(data.Dataset): def __init__(self, ds, tokenizer, max_seq_len=1024, sample_across_doc=True, non_sentence_start=0.0, filter_english=False, **kwargs): """ sentence_start: the stripped article must start with a complete sentence """ self.ds = ds self.ds_len = len(self.ds) self.num_samples = 1000 * self.ds_len self.max_seq_len = max_seq_le