Training Data of Tokenizer

Question

Training Data of Tokenizer

zheedong opened this issue 3 months ago · comments

Thanks for your great work.

In paper, you say that training data of tokenizer is 'CC3M, Unsplash, LAION-COCO, MS-COCO'. Did you use total of those three dataset? Or did you do some filtering? What is total amount of training data in tokenizer training?

And did you use same training dataset in stage 1 and stage 2 in tokenizer training?

Yuying Ge · Answer 1 · Thu Feb 29 2024 19:50:09 GMT+0800 (China Standard Time)

Yes, we use total of 'CC3M, Unsplash, LAION-COCO, MS-COCO' for training tokenizer in both stage 1 and stage2. The total amount of training data is almost 500M.

zheedong · Answer 2 · Mon Mar 04 2024 12:24:26 GMT+0800 (China Standard Time)

I saw your code, but I cannot find configs about training dataset. Can you tell me more details about it? How many epochs do you train?