About the evaluation
LuckyyySTA opened this issue · comments
Hello, awesome work!
I would like to ask about some specific issues I encountered during the evaluation process. In the ICL setting, although all citation formats in the demonstration are standardized (citations appear before the period), I noticed that the model output occasionally shows irregular formats, such as:
(1) This is a statement. [1] This is another statement.
(2) This is a statement. [1]. This is another statement.
In the first case, when calculating the citation quality metric, it actually computes NLI("This is another statement", doc[1]), instead of the ideal NLI("This is a statement", doc[1]).
In the second case, it actually computes NLI(".", doc[1]).
I am wondering how you consider these exceptional cases. Should we standardize the output format first in such situations? Or am I misunderstanding something? Please correct me if I am wrong.
Hi,
This is a good point! We see those cases as that the model is not strong enough to follow instructions/formats and it's fair that their scores get punished accordingly. Though I don't see this happen often with gpt-3.5/4. What models are you experimenting with?
Hi,
Thank you for your prompt response! I agree with the points you made. The cases I mentioned are from testing the ELI5 dataset on llama-3-8b via in-context learning recently. As you mentioned, the model occasionally struggles to follow the required output format, especially when the base model isn't strong enough.
Thanks again for your insights!