Investigate Expert Models Having High Perplexity
mrcabbage972 opened this issue · comments
Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.
Here are some issues that may have caused this:
- no warmup
- LR too high
- too few steps
- mixing in pile data
- too many gradient accumulation steps
- measurement error
The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.
Further investigation - train.py has a --do-eval
option that also computes the perplexity. After running both the base model and the arxiv model through this on the arxiv dataset, I find the same discrepancy as in the dedicated perplexity script. This rules out any concern I had about if it was just a different data/tokenization pipeline in the perplexity script vs the train.