stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

arwen is checkpoint progression outlier?

hawkrobe opened this issue · comments

I had a quick backchannel with @siddk, but was curious if anyone else had noticed that the Arwen seed is an extreme outlier in its checkpoint progression. We've been examining properties of attention matrices across the training trajectory, and noticed that at arwen's first checkpoint (checkpoint-10), its internal state and behavior looks almost exactly like the internal states and behavior that the 9 other seeds achieve significantly later, around checkpoint-4000. It made us wonder whether the checkpoint labeling scheme might be different for Arwen?

Some (internal) plots are attached as examples, but it shows up as an outlier on all metrics we've tried. The most dramatic example for us was the final plot, which shows a rather complex summary statistic computed on attention matrices across layers. It was striking to us how this very derived metric shows precisely the same profile across layers at the beginning as the other models do much later on, and also seems to be rather stable for Arwen up to that point, when it starts changing again.

We've checked carefully for bugs in our own code, and it's possible there's something we're missing, but we're running all the different models through the same pipeline with a fresh pull of the checkpoints, so it does seem to be a property of the checkpoints themselves. We're trying to determine whether the Arwen seed genuinely stumbled across this pattern extremely early on, which seems unlikely to be produced so quickly given learning rates and the relatively small number of observations up to that point. Or whether something may have gotten jumbled up with labels?

We're extremely grateful for MISTRAL as an incredible resource, and would very much appreciate any advice from others who have played with the checkpoints.

Accuracy on task (pdf)
Aggregated attention matrix statistic (pdf)
Layerwise attention matrix statistic (pdf)

CC @J38 @dlwh @Tiiiger and @lorr1; do y'all remember if other folks who've been doing interpretability work with Mistral checkpoints have run into this before?

commented

I don't see any evidence arwen is different from celebrimbor ... if you look at the loss curves they are very similar ... this seems to suggest there is some kind of labeling issue ...

commented

We should probably download the step-10 checkpoints for each model run and check the loss on wikitext ...

commented

So for whatever reason the arwen checkpoint for 10 steps is wrong ... I am not sure where that error occurred ... if you download the arwen checkpoint and the celebrimbor checkpoint they have wildly different losses ...

commented

The arwen step-10 checkpoint does not have a loss on wikitext or lambada consistent with the trainer_state logging ... I will spot sample some other checkpoints ...

commented

At some point all of these checkpoints were stored on Google Cloud (before we deleted them) ... when they were migrated to Hugging Face I did a random sample where I compared the checkpoint on HF to Google Cloud and none of the samples were a mismatch ...

commented

My basic analysis right now is something is off with the arwen checkpoints below 3000 (maybe even higher) ... it looks like after 3000 the checkpoints are having expected loss values ... the celebrimbor ones below 3000 seem fine ... hopefully this is isolated to the early checkpoints for arwen ...

commented

As I said before, I am not sure at one point in the process this issue emerged ... it's possible the original arwen checkpoints were incorrect or something happened in the copying and uploading to HF process ...

@J38 @dlwh - are the original checkpoints still in the GCP bucket? Can we try finding the originals somewhere? They also might be on the NLP cluster?

@J38 thanks so much for looking into this. it's a relief (on our end) to hear that the deviations from expected loss values pre-3000 are consistent with our observation of other properties pre-3000 (everything else seems to align after 3000).

Glad we're starting to get to the bottom of this. @hawkrobe - sorry that I didn't surface this sooner in the original email thread. Hopefully we still have the originals around, and can rectify this!

commented

They're deleted and I think you did it ... or me ... don't remember ...