itsnamgyu / reasoning-teacher

Official code for "Large Language Models Are Reasoning Teachers", ACL 2023

Home Page:https://arxiv.org/abs/2212.10071

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The diverse reasoning doesn't has the function as you said,and you provide data that B_text-davinci-002__C_zs_cot_t70 is incomplete

purongcheng opened this issue · comments

Why is the data that B_text-davinci-002__C_zs_cot_t70 you provide incomplete? Every data set will be missing some samples,like D_addsub.json missing samples with [147, 277, 285, 290, 163, 291, 38, 39, 41, 42, 43, 169, 172, 298, 47, 48, 177, 178, 305, 180, 312, 57, 185, 314, 62, 192, 193, 195, 323, 69, 70, 326, 72, 202, 203, 333, 334, 335, 209, 82, 211, 84, 337, 342, 88, 91, 348, 350, 352, 99, 227, 105, 361, 363, 377, 365, 251, 242, 115, 244, 117, 379, 121, 123, 381, 382] as index. And the method that diverse reasoning I used didn't produce the results you said,I don't think the diverse reasoning works as well as you say.

I think this is because of our random train-test splits. The missing samples are likely those that were assigned to the test dataset, thus we do not use teacher-generated rationales for those. Please check the paper on data splits.

There are many possible reasons for differing results. I'd be glad to discuss, if you provide some details.

Yes,many reasons will influence the result,when use different temperature Settings,the generated rationales will be different.So,set temperature ==0.7is the best choice?
I think the generated rationales are too similar,the training effect of student model is affected,how to set temperature can make generated rationales are more different?

There will be a tradeoff between diversity vs quality of samples. We selected 0.7 based on a previous paper (self-consistency) but this may not be optimal for distillation, and the optimal value will likely vary based on the model and domain.