TinyBERT实验到底用哪个enwiki-latest-pages-articles数据集？

Question

TinyBERT实验到底用哪个enwiki-latest-pages-articles数据集？

ra225 opened this issue a year ago · comments

原文第6页提到：
For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words)
我从
https://github.com/google-research/bert 指定的链接下载
the latest dump
此压缩包解压后形成了一个86G的xml文件，经本工程的预处理代码总是报超磁盘空间，且每跑十几个小时就断掉，查代码以后，将pregenerate_training_date.py文件第52行self.document_shelf_filepath的路径从/cache/目录改到外部磁盘的500G文件目录，这次终于不再报超磁盘空间，但处理速度很慢，84个小时才从第367行跑到第390行。
然后最崩溃的来了！由于后面还要跑3个epoch，又跑了2天才跑完第一个epoch的5%，合着40天才能跑完一个epoch，总共3个epoch就要120天！
仅仅数据预处理就要跑这么久吗？即使跑完，后面还要上GPU训练，会不会更久？？？
请问原文用的是哪个数据集？是不是要用华为云平台跑才能快一些？