Observing eval accuracy considerably lower than reported?

Question

Observing eval accuracy considerably lower than reported?

knagrecha opened this issue 6 months ago · comments

Hi, thanks for open-sourcing this code. I'm noticing that my tests with GPT-2 variants show considerably lower eval accuracies than what's reported in the paper & charts. I'm using the command provided in the README. I do not think the eval code itself is incorrect --- testing it with LLaMA shows much higher eval accuracies (as I would expect). But I cannot replicate the GPT-2 results; any pointers on what the issue might be?

knagrecha · Answer 1 · Fri Dec 15 2023 06:02:26 GMT+0800 (China Standard Time)

As an example:

sciq - GPT-2-medium reports accuracy of 0.43 (0.5 after I lowered the learning rate). LLaMA-7B ground truth got 0.84. LLaMA-7b-transferred got 0.43 (0.81 after I lowered the learning rate).

Jeff Wu · Answer 2 · Sat Dec 16 2023 02:10:01 GMT+0800 (China Standard Time)

0.43 is worse than random so something is either wrong with the ML there or your eval set isn't big enough

knagrecha · Answer 3 · Sat Dec 16 2023 02:32:06 GMT+0800 (China Standard Time)

Yeah I figured the eval set size seemed small but assumed that the line in the README would work directly. Might test it out again later with a larger eval size.

knagrecha · Answer 4 · Sat Dec 16 2023 03:21:33 GMT+0800 (China Standard Time)

10X'd the train/test sizes. new results on sciq with gpt2-med and llama-7b after a quick run.

GPT-2-Med ending acc: 0.661 +/- 0.006694460396477075

LLaMA-7B ending acc (gt): 0.866 +/- 0.015234434679370284

LLaMA-7B ending acc (transfer): Accuracy: 0.704 +/- 0.020414896521902825

Looks nice! Pretty closely aligned with the Qwen results, with slightly lower transfer efficacy. Hope others will add their OSS model eval results soon too.

Would suggest increasing the n_docs/n_test_docs values in the README command? Current values seem pretty low.

Jeff Wu · Answer 5 · Sat Dec 16 2023 03:35:53 GMT+0800 (China Standard Time)

haha yeah, they are low! can update that

things to generally keep in mind:

things are somewhat noisy in general, even with a large dataset. results are cleaner when averaging across many seeds. i'm not totally sure why but i think they're noisier than our internal setup was
truncating the dataset to be smaller makes things even noisier

knagrecha · Answer 6 · Sat Dec 16 2023 06:31:18 GMT+0800 (China Standard Time)

Off-topic, but I am curious about how you guys are thinking of labeling by a weak supervisor vs criticism/scoring by a weak supervisor. I guess there can be an argument in both directions, whether labeling is easier for a weak model or criticism.

knagrecha · Answer 7 · Sat Dec 16 2023 06:32:27 GMT+0800 (China Standard Time)

I guess criticism may introduce even more noise due to hallucinations, but if alignment is from the perspective of a “weaker human” to strong model, it may intuitively be easier than labeling.

agokrani · Answer 8 · Mon Dec 18 2023 17:30:07 GMT+0800 (China Standard Time)

I am having the same issue of noise on my side, could it be possible that this is because of the way classification head was initialized. The paper claims to initialize the head with embedding weights of token "0" and "1" whereas in the code it seems like we are initializing it differently.

Jeff Wu · Answer 9 · Tue Dec 19 2023 04:46:16 GMT+0800 (China Standard Time)

I actually tried initializing using unembeddings, and it didn't seem to help. but I didn't test very extensively. my hunch is it's not the issue.

by the way, there is some substantial literature on noisiness of fine-tuning, e.g. https://arxiv.org/pdf/2002.06305.pdf

agokrani · Answer 10 · Tue Dec 19 2023 16:49:50 GMT+0800 (China Standard Time)

Will look at this, thanks a lot. It would be nice to know how did you initialize with unembedding weights.

Jeff Wu · Answer 11 · Wed Dec 20 2023 02:03:41 GMT+0800 (China Standard Time)

here's the code i had used, had done it sort of hackily

        # NOTE: this has to happen after the rest of the model is initialized
        unemb = self.transformer.wte.weight.data
        assert self.num_labels == 2
        inds = [
            11491, # incorrect
            3376, # correct
        ]
        new_data = unemb[inds, :]
        self.score.weight.data.copy_(new_data)

agokrani · Answer 12 · Wed Dec 20 2023 17:05:00 GMT+0800 (China Standard Time)

Thank you so much @WuTheFWasThat, will test it