For models with other architectures, such as Qwen family, how to find the best `\alpha`, `\beta` and `\sqrt{1/t}` parameters?
ki-ljl opened this issue · comments
Junliang Li commented
The author mentioned in the paper that for the Llama family, the good values of \alpha
and \beta
are 1
and 32
, but did not mention how to obtain these two parameters. In addition, the author mentioned that \sqrt{1/t}
can be fitted by the lowest ppl. Can this part be explained more clearly?
If anyone can answer my question I would appreciate it!