princeton-nlp / ALCE

[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the evaluation

LuckyyySTA opened this issue · comments

commented

Hello, awesome work!

I would like to ask about some specific issues I encountered during the evaluation process. In the ICL setting, although all citation formats in the demonstration are standardized (citations appear before the period), I noticed that the model output occasionally shows irregular formats, such as:

(1) This is a statement. [1] This is another statement.
(2) This is a statement. [1]. This is another statement.

In the first case, when calculating the citation quality metric, it actually computes NLI("This is another statement", doc[1]), instead of the ideal NLI("This is a statement", doc[1]).

In the second case, it actually computes NLI(".", doc[1]).

I am wondering how you consider these exceptional cases. Should we standardize the output format first in such situations? Or am I misunderstanding something? Please correct me if I am wrong.

Hi,

This is a good point! We see those cases as that the model is not strong enough to follow instructions/formats and it's fair that their scores get punished accordingly. Though I don't see this happen often with gpt-3.5/4. What models are you experimenting with?

commented

Hi,

Thank you for your prompt response! I agree with the points you made. The cases I mentioned are from testing the ELI5 dataset on llama-3-8b via in-context learning recently. As you mentioned, the model occasionally struggles to follow the required output format, especially when the base model isn't strong enough.

Thanks again for your insights!