The pretraining corpus of GLM-Large-Chinese
cklsoft opened this issue · comments
Hi,
- What is the pretraining corpus of
GLM-Large-Chinese
/GLM-10B-Chinese
released ?Wiki+BookCorpus
in README orwudao baike zhihu
(inconfig/ds_block_large_chinese.sh
) ? - Besides, how large is the corpus used to train
GLM-Large-Chinese
andGLM-10B-Chinese
?
Thanks.
Sorry for the mistake in the README.
Both Chinese models are pre-trained on WuDaoCorpus (1.1TB), Baidu Baike (87GB) and Zhihu (131GB)。