THUDM / GLM

GLM (General Language Model)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The pretraining corpus of GLM-Large-Chinese

cklsoft opened this issue · comments

Hi,

  1. What is the pretraining corpus of GLM-Large-Chinese/GLM-10B-Chinese released ? Wiki+BookCorpus in README or wudao baike zhihu(in config/ds_block_large_chinese.sh) ?
  2. Besides, how large is the corpus used to train GLM-Large-Chinese and GLM-10B-Chinese ?
    Thanks.

Sorry for the mistake in the README.
Both Chinese models are pre-trained on WuDaoCorpus (1.1TB), Baidu Baike (87GB) and Zhihu (131GB)。