princeton-nlp / ALCE

[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot reproduce experiments with ChatGPT

Jerrrrykun opened this issue · comments

Hi,
I am running experiments with gpt-3.5-turbo-0301 from OpenAI's API. When I was rerunning setting: qampari_turbo_shot2_ndoc10_gtr_summary.yaml with QAMPARI dataset, my chatgpt was producing worse results (like cannot following the instruction well enough: "cite one and only one document for each answer"). And after running the eval with the inference result, it is indeed at least 5-6 points (for both correctness and citation quality) lower than the results reported in your paper, which is just out of the range even taking variance into account.

Bad case example:

Question: Arsenal F.C. competed in and won which competition?
Gold answer: 2002 FA Cup Final, 2003 FA Cup Final, 2005 FA Cup Final, 1987 Football League Cup Final, 2014 FA Cup Final, 2020 FA Cup Final, 2015 FA Community Shield, 2017 FA Community Shield, 2017 FA Cup Final, 2015 FA Cup Final, 2014 FA Community Shield, 1971 FA Cup Final, 1979 FA Cup Final, 1998 FA Cup Final, 1993 FA Cup Final, 1936 FA Cup Final, 1930 FA Cup Final, 1950 FA Cup Final, 2016 MLS All-Star Game, 1969-70 Inter-Cities Fairs Cup, 1994 European Cup Winners' Cup Final, 1993-94 European Cup Winners' Cup, 1993 Football League Cup Final, 2016-17 FA Cup.
Final model output: UEFA Cup [1], Inter-Cities Fairs Cup [2, 4], European Cup Winners' Cup [3, 4], FA Cup [5, 7, 8],

So may I know something more detailed of your ChatGPT results?

Hi,

I am not sure what caused this problem, but you can find all our chatgpt output results here:

https://drive.google.com/file/d/1cONFQbvANi-9mBCSBuLveS5cK8zko873/view?usp=sharing

Note that one potential difference is that we are using the Azure version of OpenAI API, which may cause slight difference. But 5-6 points sound too big.

Thanks for your replay!

I partially fix it (i.e. the correctness is right but the citation quality is still way lower) by noting that we are silently truncating the query only based on the response code.

For citation quality, I guess the problem might lie in the ChatGPT version. We are now using the gpt-3.5-turbo-0301 from OpenAI but I noticed that the results you provided were using gpt-35-turbo from azure_api_version = 2022-12-01. So it is quite hard for me to find the exact same model to completely fix this.

noting that we are silently truncating the query only based on the response code.

Can you elaborate on this? I don't quite get it.

I noticed that the results you provided were using gpt-35-turbo from azure_api_version = 2022-12-01. So it is quite hard for me to find the exact same model to completely fix this.

We were actually using gpt-3.5-turbo-0301 too from Azure. One thing is that the Azure API only allows using Completion format but the OpenAI API only allows ChatCompletion format (in ChatCompletion format you are forced to use the chat prompt). So it is unclear whether it can affect the performance.

I do remember we used the OpenAI API at the very beginning. But the results should be quite close to what we get now (maybe 1~2% difference).

noting that we are silently truncating the query only based on the response code.

Can you elaborate on this? I don't quite get it.

For this, I meant I misused the error code of the response from the API to truncate the query ( query = query[:4097]) which wrongly truncated some questions.

I noticed that the results you provided were using gpt-35-turbo from azure_api_version = 2022-12-01. So it is quite hard for me to find the exact same model to completely fix this.

We were actually using gpt-3.5-turbo-0301 too from Azure. One thing is that the Azure API only allows using Completion format but the OpenAI API only allows ChatCompletion format (in ChatCompletion format you are forced to use the chat prompt). So it is unclear whether it can affect the performance.

Yes. I also see this from the docs of OpenAI and Azure. And I tried many settings: using get-3.5-turbo (default one), w/o or w/ system message like You are an assistant ..... or adding stop tokens like '\n' and '\n\n' while keeping other params the same with yours. But the results in citation quality are not good.(It cannot follow the 'only citing one doc for each answer' instruction, leading to quite low recall and precision.)

In fact, the following scores is my chatgpt eval result with seed=42.

{
    "citation_rec": 14.853001834802834,
    "citation_prec": 16.089963053960066,
    "prec": 23.166738908621927,
    "rec": 14.31657222240696,
    "rec_top5": 25.160000000000004,
    "f1": 15.29996245222911,
    "f1_top5": 21.813072587519873
}

Reference: your result(from qampari-gpt-35-turbo-gtr_summary-shot2-ndoc10-42-azure.json) in this seed

{
    "citation_rec": 22.687014612275277,
    "citation_prec": 25.346391368347597,
    "prec": 20.492103024819578,
    "rec": 13.40283666562549,
    "rec_top5": 23.72,
    "f1": 13.848076827660902,
    "f1_top5": 19.737793138349907
}

do you have some output examples? I would be surprised that gpt-3.5-turbo-0301 can't even follow the format correctly

do you have some output examples? I would be surprised that gpt-3.5-turbo-0301 can't even follow the format correctly

My chatgpt result is here. Compared with yours, it seems to have more tendency in generating answers with more than 1 citation by manual check.

Interesting that the model cannot follow the instruction in some cases but I'm not sure if that's the reason that the citation score is low. Can you try the latest gpt-3.5-turbo model instead of the 0301 version and see if it makes a difference?

Another way to debug it is to try gpt-3.5 on the other subsets (ASQA and ELI5) and see if there is a difference.

Interesting that the model cannot follow the instruction in some cases but I'm not sure if that's the reason that the citation score is low. Can you try the latest gpt-3.5-turbo model instead of the 0301 version and see if it makes a difference?

The above result is generated by gpt-3.5-turbo. That is the best I can get. I tried gpt-3.5-turbo-0301 but it had lower performance.

Can you try other datasets / other configs?

Can you try other datasets / other configs?

Yes I tried other settings (like default and extraction). The 'default' performs in accordance with your result but the 'extraction' didn't. BTW, I notice the following data loading warning when running 'summary' and 'extraction' modes: would it be the lack of summary that make the chatgpt perform worse?

No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 8 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 7 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 7 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 8 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 8 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 31725.52it/s]
Done.

Hi,

The warning message is normal. That means in the top100 passages we retrieved, there are fewer than 5 that ChatGPT thinks is "relevant". Can you try other datasets too? I am also puzzled by this (one of the problems to use API lol)...