The pretraining corpus of GLM-Large-Chinese

Question

cklsoft opened this issue a year ago · comments

Hi,

What is the pretraining corpus of GLM-Large-Chinese/GLM-10B-Chinese released ? Wiki+BookCorpus in README or wudao baike zhihu(in config/ds_block_large_chinese.sh) ?
Besides, how large is the corpus used to train GLM-Large-Chinese and GLM-10B-Chinese ?
Thanks.

Zhengxiao Du · Answer 1 · Fri Dec 30 2022 22:20:17 GMT+0800 (China Standard Time)

Sorry for the mistake in the README.
Both Chinese models are pre-trained on WuDaoCorpus (1.1TB), Baidu Baike (87GB) and Zhihu (131GB)。