FranxYao / Long-Context-Data-Engineering

this is the result we get with the code in this repo. we follow the readme step by step, making sure the environment, model and requirement are the same with the repo, but we are puzzled that we can not have the same score, especially at about 4k-tokens where the score is very low.
我们使用仓库的源代码，模型和环境进行复现，得到的结果是上面这张图，想请问一下可能是哪里出现问题？

Hi thanks for the interest! I wonder:

If you could paste the output of the wrong predictions and let's take a look at what the model says?
I was told that the model behavior may be sensitive to prompt. So maybe change the prompt at https://github.com/FranxYao/Long-Context-Data-Engineering/blob/main/eval/needle/needle_in_haystack.py#L201 to:

test_format=f"<|im_start|> This is a very long story book: <book> {context} </book>.\n Based on the content of the book, Question: {self.retrieval_question}\nAnswer: The best thing to do in San Francisco is"

OK try this branch which may fix the repeating issue at 4K length:
https://github.com/FranxYao/Long-Context-Data-Engineering/tree/fix_rope

Where the difference is at the following line:

Long-Context-Data-Engineering/eval/needle/needle_in_haystack.py

Line 171 in 5b1fbd4

    
           reset_rope(self.model_to_test, model_max_train_len=81920, scaling_factor=scaling_factor)

llama-2-7b-80k-result.zip
This was the output we get yesterday.
And we are trying the new branch and new prompt.

Here is what I get from the new branch.

The model behavior is quite interesting though. If you could confirm you can get similar results I'll merge it to the main branch.

Here is what I get from the new branch.
The model behavior is quite interesting though. If you could confirm you can get similar results I'll merge it to the main branch.

Does it use the origin prompt?

Yes it does (thought would also be interesting to compare the two prompts)

We've implemented the new branch and observed improved results. Indeed, it's significantly better. However, the overall score is 0.848, which still presents a slight discrepancy compared to the results mentioned in your repository. Could there be any potential reasons we might have overlooked?

One more comment before addressing: I won't close this issue until more people have seen it and verify if they can or cannot replicate my results.

Then back to the problem, let me first list the related packages here:
torch==2.0.0+cu118
transformers==4.35.2
flash-attn==2.3.6
tensor_parallel==2.0.0
If you may check your torch is a different version?

Also if you are using the default prompt, maybe try adding "The best thing to do in San Francisco is" after "Answer:" ?

Hey @FranxYao , thanks for the great work on this paper. I was wondering, did you use the prompt in the code or the modified prompt above for the figure in the paper?

# prompt in the code
f"<|im_start|> This is a very long story book: <book> {context} </book>.\n Based on the content of the book, Question: {self.retrieval_question}\nAnswer:"

# prompt suggested above
f"<|im_start|> This is a very long story book: <book> {context} </book>.\n Based on the content of the book, Question: {self.retrieval_question}\nAnswer: The best thing to do in San Francisco is"

Thanks!

@marcobellagente93 and I have noticed a similar inconsistency with the needle-in-a-haystack evaluation for another project.
We have tracked down the problem to the reading of PaulGraham essays with glob.glob which is non deterministic, therefore the files are loaded in arbitrary order leading to a different ‘{context}’ being inserted in the prompt. Which in turn then leads to different behaviour of the model.
Interestingly this phenomenon only appears in different clones of the repo, for one repo the ‘{context}’ seems to be consistent. You can easily verify this behaviour by cloning the Needle-in-a-haystack repo twice and printing the first 200 characters of the context in the read_context_files function.

For https://arxiv.org/abs/2402.17834 we therefore opted to report the mean and std across 10 runs with random context.

Great find! I totally missed that. File order being different after every cloning but consistent within each clone is probably because files are copied and registered by the file system arbitrarily.

@marcobellagente93 and I have noticed a similar inconsistency with the needle-in-a-haystack evaluation for another project. We have tracked down the problem to the reading of PaulGraham essays with glob.glob which is non deterministic, therefore the files are loaded in arbitrary order leading to a different ‘{context}’ being inserted in the prompt. Which in turn then leads to different behaviour of the model. Interestingly this phenomenon only appears in different clones of the repo, for one repo the ‘{context}’ seems to be consistent. You can easily verify this behaviour by cloning the Needle-in-a-haystack repo twice and printing the first 200 characters of the context in the read_context_files function.

For https://arxiv.org/abs/2402.17834 we therefore opted to report the mean and std across 10 runs with random context.

Thanks for the meticulous study!! I guess this may contribute to the reason.

Also I have updated the code and merged the fix_rope branch. The model should give more stable output now

It seems the result we get is not the same as the repo shows