arwen is checkpoint progression outlier?

Question

arwen is checkpoint progression outlier?

hawkrobe opened this issue a year ago · comments

I had a quick backchannel with @siddk, but was curious if anyone else had noticed that the Arwen seed is an extreme outlier in its checkpoint progression. We've been examining properties of attention matrices across the training trajectory, and noticed that at arwen's first checkpoint (checkpoint-10), its internal state and behavior looks almost exactly like the internal states and behavior that the 9 other seeds achieve significantly later, around checkpoint-4000. It made us wonder whether the checkpoint labeling scheme might be different for Arwen?

Some (internal) plots are attached as examples, but it shows up as an outlier on all metrics we've tried. The most dramatic example for us was the final plot, which shows a rather complex summary statistic computed on attention matrices across layers. It was striking to us how this very derived metric shows precisely the same profile across layers at the beginning as the other models do much later on, and also seems to be rather stable for Arwen up to that point, when it starts changing again.

We've checked carefully for bugs in our own code, and it's possible there's something we're missing, but we're running all the different models through the same pipeline with a fresh pull of the checkpoints, so it does seem to be a property of the checkpoints themselves. We're trying to determine whether the Arwen seed genuinely stumbled across this pattern extremely early on, which seems unlikely to be produced so quickly given learning rates and the relatively small number of observations up to that point. Or whether something may have gotten jumbled up with labels?

We're extremely grateful for MISTRAL as an incredible resource, and would very much appreciate any advice from others who have played with the checkpoints.

Accuracy on task (pdf)
Aggregated attention matrix statistic (pdf)
Layerwise attention matrix statistic (pdf)

Sidd Karamcheti · Answer 1 · Wed Jul 05 2023 00:59:50 GMT+0800 (China Standard Time)

CC @J38 @dlwh @Tiiiger and @lorr1; do y'all remember if other folks who've been doing interpretability work with Mistral checkpoints have run into this before?

J38 · Answer 2 · Thu Jul 06 2023 13:59:51 GMT+0800 (China Standard Time)

I don't see any evidence arwen is different from celebrimbor ... if you look at the loss curves they are very similar ... this seems to suggest there is some kind of labeling issue ...

J38 · Answer 3 · Thu Jul 06 2023 14:20:16 GMT+0800 (China Standard Time)

We should probably download the step-10 checkpoints for each model run and check the loss on wikitext ...

J38 · Answer 4 · Thu Jul 06 2023 15:22:20 GMT+0800 (China Standard Time)

So for whatever reason the arwen checkpoint for 10 steps is wrong ... I am not sure where that error occurred ... if you download the arwen checkpoint and the celebrimbor checkpoint they have wildly different losses ...

J38 · Answer 5 · Thu Jul 06 2023 15:26:41 GMT+0800 (China Standard Time)

The arwen step-10 checkpoint does not have a loss on wikitext or lambada consistent with the trainer_state logging ... I will spot sample some other checkpoints ...

J38 · Answer 6 · Thu Jul 06 2023 15:30:18 GMT+0800 (China Standard Time)

At some point all of these checkpoints were stored on Google Cloud (before we deleted them) ... when they were migrated to Hugging Face I did a random sample where I compared the checkpoint on HF to Google Cloud and none of the samples were a mismatch ...

J38 · Answer 7 · Thu Jul 06 2023 16:15:13 GMT+0800 (China Standard Time)

My basic analysis right now is something is off with the arwen checkpoints below 3000 (maybe even higher) ... it looks like after 3000 the checkpoints are having expected loss values ... the celebrimbor ones below 3000 seem fine ... hopefully this is isolated to the early checkpoints for arwen ...

J38 · Answer 8 · Thu Jul 06 2023 16:16:33 GMT+0800 (China Standard Time)

As I said before, I am not sure at one point in the process this issue emerged ... it's possible the original arwen checkpoints were incorrect or something happened in the copying and uploading to HF process ...

Sidd Karamcheti · Answer 9 · Thu Jul 06 2023 23:11:51 GMT+0800 (China Standard Time)

@J38 @dlwh - are the original checkpoints still in the GCP bucket? Can we try finding the originals somewhere? They also might be on the NLP cluster?

Robert Hawkins · Answer 10 · Fri Jul 07 2023 01:13:24 GMT+0800 (China Standard Time)

@J38 thanks so much for looking into this. it's a relief (on our end) to hear that the deviations from expected loss values pre-3000 are consistent with our observation of other properties pre-3000 (everything else seems to align after 3000).

Sidd Karamcheti · Answer 11 · Fri Jul 07 2023 01:42:44 GMT+0800 (China Standard Time)

Glad we're starting to get to the bottom of this. @hawkrobe - sorry that I didn't surface this sooner in the original email thread. Hopefully we still have the originals around, and can rectify this!

J38 · Answer 12 · Fri Jul 07 2023 10:32:40 GMT+0800 (China Standard Time)

They're deleted and I think you did it ... or me ... don't remember ...