About the evaluation

Question

About the evaluation

LuckyyySTA opened this issue 2 months ago · comments

Hello, awesome work!

I would like to ask about some specific issues I encountered during the evaluation process. In the ICL setting, although all citation formats in the demonstration are standardized (citations appear before the period), I noticed that the model output occasionally shows irregular formats, such as:

(1) This is a statement. [1] This is another statement.
(2) This is a statement. [1]. This is another statement.

In the first case, when calculating the citation quality metric, it actually computes NLI("This is another statement", doc[1]), instead of the ideal NLI("This is a statement", doc[1]).

In the second case, it actually computes NLI(".", doc[1]).

I am wondering how you consider these exceptional cases. Should we standardize the output format first in such situations? Or am I misunderstanding something? Please correct me if I am wrong.

Tianyu Gao · Answer 1 · Thu May 16 2024 20:37:26 GMT+0800 (China Standard Time)

Hi,

This is a good point! We see those cases as that the model is not strong enough to follow instructions/formats and it's fair that their scores get punished accordingly. Though I don't see this happen often with gpt-3.5/4. What models are you experimenting with?

init · Answer 2 · Thu May 16 2024 21:01:10 GMT+0800 (China Standard Time)

Hi,

Thank you for your prompt response! I agree with the points you made. The cases I mentioned are from testing the ELI5 dataset on llama-3-8b via in-context learning recently. As you mentioned, the model occasionally struggles to follow the required output format, especially when the base model isn't strong enough.

Thanks again for your insights!