huu4ontocord / MDEL

Multi-Domain Expert Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate Expert Models Having High Perplexity

mrcabbage972 opened this issue · comments

Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.

Here are some issues that may have caused this:

  • no warmup
  • LR too high
  • too few steps
  • mixing in pile data
  • too many gradient accumulation steps
  • measurement error

The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.

Further investigation - train.py has a --do-eval option that also computes the perplexity. After running both the base model and the arxiv model through this on the arxiv dataset, I find the same discrepancy as in the dedicated perplexity script. This rules out any concern I had about if it was just a different data/tokenization pipeline in the perplexity script vs the train.