请问为什么roberta_large比roberta_middle在CMRC2018上低很多？

Question

请问为什么roberta_large比roberta_middle在CMRC2018上低很多？

ewrfcas opened this issue 5 years ago · comments

https://hfl-rc.github.io/cmrc2018/task/#section-1
想测一下roberta在阅读理解上的性能如何。尝试将middle和large转成pytorch在cmrc2018上跑了一下，middle的F1能到86，但是large的F1只能到77，非常奇怪。
直接使用提供的pytorch版本的large权重效果也是一样。

yingzq · Answer 1 · Mon Sep 09 2019 15:11:32 GMT+0800 (China Standard Time)

@ewrfcas 请问roberta-middle在哪里，我为什么没有在界面上看见.

ewrfcas · Answer 2 · Mon Sep 09 2019 15:22:45 GMT+0800 (China Standard Time)

@YingZiqiang 是Roberta_l24_zh_base，24层，12head，768hidden的。

brightmart · Answer 3 · Mon Sep 09 2019 21:15:32 GMT+0800 (China Standard Time)

在我们的测试里large效果比middle要好。你训练的超参数怎么样的，能否贴出来，batch size多少。

ewrfcas · Answer 4 · Mon Sep 09 2019 22:34:02 GMT+0800 (China Standard Time)

@brightmart 感谢回复，我large我是用5卡batchsize30训练的，middle是32，一共3个epoch，lr=3e-5/2e-5，warmup=0.1。除了batchsize基本和middle没区别。

ewrfcas · Answer 5 · Mon Sep 09 2019 22:35:32 GMT+0800 (China Standard Time)

另外，large和middle的词表应该是相同的吧？那预处理应该不会有问题才对。。

brightmart · Answer 6 · Mon Sep 09 2019 23:14:36 GMT+0800 (China Standard Time)

词汇表是一模一样的哦。你看看这两个large和middel的文件夹下的名称。是不是large的checkpoint没有加载成功呢。再跑一次，看看checkpoint加载成功了没，batch size用相同的32。

Yiming Cui · Answer 7 · Tue Sep 10 2019 09:04:42 GMT+0800 (China Standard Time)

Same question here.
尝试了三个阅读理解数据集：CMRC 2018, DRCD, CJRC在large上的效果都比较差（不是init_ckpt没加载的问题）。但XNLI可以得到比 @brightmart 报告的更好的结果。或许large不是max_seq_len=512训出来的？

ewrfcas · Answer 8 · Tue Sep 10 2019 10:45:53 GMT+0800 (China Standard Time)

加载应该是成功的，我对比过参数，没有加载的只有cls的pooler相关的权重

brightmart · Answer 9 · Tue Sep 10 2019 19:35:38 GMT+0800 (China Standard Time)

@ymcui 是的，现有的roberta是在max_seq_len为256上训练的，可以适合处理这范围内的；那么对于长文本，如超过256，可以效果不好。

阅读理解的效果测试结果是怎么样？

@ewrfcas

Yiming Cui · Answer 10 · Tue Sep 10 2019 19:45:07 GMT+0800 (China Standard Time)

@brightmart
OK, got it. Thanks.

ewrfcas · Answer 11 · Tue Sep 10 2019 21:51:42 GMT+0800 (China Standard Time)

我在CMRC2018上测试结果都是基于512长度的，middle的F1在5次里是86~87，large的F1大概要低10个点，在75~77左右，256长度的large结果正在测试中
@brightmart 希望能够调整下large模型config文件的max_position_embeddings

ewrfcas · Answer 12 · Tue Sep 10 2019 22:49:37 GMT+0800 (China Standard Time)

目前测roberta-large长度256在CMRC2018的dev结果为
F1：88.365, EM:69.991
lr=2e-5 epoch1最佳

brightmart · Answer 13 · Tue Sep 10 2019 23:36:28 GMT+0800 (China Standard Time)

所有，初步的看，在这个阅读理解任务上，和其他模型比，怎么样呢？为什么阅读理解还能将长度设为这么小。

ewrfcas · Answer 14 · Wed Sep 11 2019 09:15:03 GMT+0800 (China Standard Time)

这个结果目前看来在ERNIE2.0 base到ERNIE2.0 large之间，在预训练模型里效果算比较好的了。
长度设为256依靠划窗可以跑，但是效果会有一点下降

brightmart · Answer 15 · Sun Sep 15 2019 22:56:13 GMT+0800 (China Standard Time)

好的。 @ewrfcas 是否可以测试对比一下XLNet_zh_Large在CMRC2018数据集上的效果？

（目前的XLNet_zh_Large是尝鲜版，如有问题会协助解决）

ewrfcas · Answer 16 · Mon Sep 16 2019 09:21:52 GMT+0800 (China Standard Time)

@brightmart xlnet如果是用sentencepiece的话做阅读理解效果不好，具体可见ymcui/Chinese-XLNet#11

oyjxer · Answer 17 · Mon Oct 21 2019 20:59:49 GMT+0800 (China Standard Time)

这个结果目前看来在ERNIE2.0 base到ERNIE2.0 large之间，在预训练模型里效果算比较好的了。
长度设为256依靠划窗可以跑，但是效果会有一点下降

划窗具体怎么操作？@ewrfcas

Huang Zhuangze · Answer 18 · Tue Oct 29 2019 15:01:33 GMT+0800 (China Standard Time)

这个结果目前看来在ERNIE2.0 base到ERNIE2.0 large之间，在预训练模型里效果算比较好的了。
长度设为256依靠划窗可以跑，但是效果会有一点下降

划窗具体怎么操作？@ewrfcas

插个眼..同好奇

ewrfcas · Answer 19 · Wed Oct 30 2019 14:27:30 GMT+0800 (China Standard Time)

划窗可以参考google官方squad代码，或者https://github.com/ewrfcas/bert_cn_finetune/blob/master/preprocess/cmrc2018_preprocess.py