Cannot reproduce experiments with ChatGPT

Question

Cannot reproduce experiments with ChatGPT

Jerrrrykun opened this issue a year ago · comments

Hi,
I am running experiments with gpt-3.5-turbo-0301 from OpenAI's API. When I was rerunning setting: qampari_turbo_shot2_ndoc10_gtr_summary.yaml with QAMPARI dataset, my chatgpt was producing worse results (like cannot following the instruction well enough: "cite one and only one document for each answer"). And after running the eval with the inference result, it is indeed at least 5-6 points (for both correctness and citation quality) lower than the results reported in your paper, which is just out of the range even taking variance into account.

Bad case example:

Question: Arsenal F.C. competed in and won which competition?
Gold answer: 2002 FA Cup Final, 2003 FA Cup Final, 2005 FA Cup Final, 1987 Football League Cup Final, 2014 FA Cup Final, 2020 FA Cup Final, 2015 FA Community Shield, 2017 FA Community Shield, 2017 FA Cup Final, 2015 FA Cup Final, 2014 FA Community Shield, 1971 FA Cup Final, 1979 FA Cup Final, 1998 FA Cup Final, 1993 FA Cup Final, 1936 FA Cup Final, 1930 FA Cup Final, 1950 FA Cup Final, 2016 MLS All-Star Game, 1969-70 Inter-Cities Fairs Cup, 1994 European Cup Winners' Cup Final, 1993-94 European Cup Winners' Cup, 1993 Football League Cup Final, 2016-17 FA Cup.
Final model output: UEFA Cup [1], Inter-Cities Fairs Cup [2, 4], European Cup Winners' Cup [3, 4], FA Cup [5, 7, 8],

So may I know something more detailed of your ChatGPT results?

Tianyu Gao · Answer 1 · Mon Aug 07 2023 08:03:36 GMT+0800 (China Standard Time)

Hi,

I am not sure what caused this problem, but you can find all our chatgpt output results here:

https://drive.google.com/file/d/1cONFQbvANi-9mBCSBuLveS5cK8zko873/view?usp=sharing

Note that one potential difference is that we are using the Azure version of OpenAI API, which may cause slight difference. But 5-6 points sound too big.

Zhikun Xu · Answer 2 · Tue Aug 08 2023 17:28:51 GMT+0800 (China Standard Time)

Thanks for your replay!

I partially fix it (i.e. the correctness is right but the citation quality is still way lower) by noting that we are silently truncating the query only based on the response code.

For citation quality, I guess the problem might lie in the ChatGPT version. We are now using the gpt-3.5-turbo-0301 from OpenAI but I noticed that the results you provided were using gpt-35-turbo from azure_api_version = 2022-12-01. So it is quite hard for me to find the exact same model to completely fix this.

Tianyu Gao · Answer 3 · Tue Aug 08 2023 21:19:11 GMT+0800 (China Standard Time)

noting that we are silently truncating the query only based on the response code.

Can you elaborate on this? I don't quite get it.

I noticed that the results you provided were using gpt-35-turbo from azure_api_version = 2022-12-01. So it is quite hard for me to find the exact same model to completely fix this.

We were actually using gpt-3.5-turbo-0301 too from Azure. One thing is that the Azure API only allows using Completion format but the OpenAI API only allows ChatCompletion format (in ChatCompletion format you are forced to use the chat prompt). So it is unclear whether it can affect the performance.

I do remember we used the OpenAI API at the very beginning. But the results should be quite close to what we get now (maybe 1~2% difference).

Zhikun Xu · Answer 4 · Wed Aug 09 2023 01:51:32 GMT+0800 (China Standard Time)

noting that we are silently truncating the query only based on the response code.

Can you elaborate on this? I don't quite get it.

For this, I meant I misused the error code of the response from the API to truncate the query ( query = query[:4097]) which wrongly truncated some questions.

I noticed that the results you provided were using gpt-35-turbo from azure_api_version = 2022-12-01. So it is quite hard for me to find the exact same model to completely fix this.

We were actually using gpt-3.5-turbo-0301 too from Azure. One thing is that the Azure API only allows using Completion format but the OpenAI API only allows ChatCompletion format (in ChatCompletion format you are forced to use the chat prompt). So it is unclear whether it can affect the performance.

Yes. I also see this from the docs of OpenAI and Azure. And I tried many settings: using get-3.5-turbo (default one), w/o or w/ system message like You are an assistant ..... or adding stop tokens like '\n' and '\n\n' while keeping other params the same with yours. But the results in citation quality are not good.(It cannot follow the 'only citing one doc for each answer' instruction, leading to quite low recall and precision.)

In fact, the following scores is my chatgpt eval result with seed=42.

{
    "citation_rec": 14.853001834802834,
    "citation_prec": 16.089963053960066,
    "prec": 23.166738908621927,
    "rec": 14.31657222240696,
    "rec_top5": 25.160000000000004,
    "f1": 15.29996245222911,
    "f1_top5": 21.813072587519873
}

Reference: your result(from qampari-gpt-35-turbo-gtr_summary-shot2-ndoc10-42-azure.json) in this seed

{
    "citation_rec": 22.687014612275277,
    "citation_prec": 25.346391368347597,
    "prec": 20.492103024819578,
    "rec": 13.40283666562549,
    "rec_top5": 23.72,
    "f1": 13.848076827660902,
    "f1_top5": 19.737793138349907
}

Tianyu Gao · Answer 5 · Wed Aug 09 2023 05:53:08 GMT+0800 (China Standard Time)

do you have some output examples? I would be surprised that gpt-3.5-turbo-0301 can't even follow the format correctly

Zhikun Xu · Answer 6 · Wed Aug 09 2023 13:43:41 GMT+0800 (China Standard Time)

do you have some output examples? I would be surprised that gpt-3.5-turbo-0301 can't even follow the format correctly

My chatgpt result is here. Compared with yours, it seems to have more tendency in generating answers with more than 1 citation by manual check.

Tianyu Gao · Answer 7 · Thu Aug 10 2023 05:32:16 GMT+0800 (China Standard Time)

Interesting that the model cannot follow the instruction in some cases but I'm not sure if that's the reason that the citation score is low. Can you try the latest gpt-3.5-turbo model instead of the 0301 version and see if it makes a difference?

Tianyu Gao · Answer 8 · Thu Aug 10 2023 05:33:13 GMT+0800 (China Standard Time)

Another way to debug it is to try gpt-3.5 on the other subsets (ASQA and ELI5) and see if there is a difference.

Zhikun Xu · Answer 9 · Thu Aug 10 2023 11:12:04 GMT+0800 (China Standard Time)

Interesting that the model cannot follow the instruction in some cases but I'm not sure if that's the reason that the citation score is low. Can you try the latest gpt-3.5-turbo model instead of the 0301 version and see if it makes a difference?

The above result is generated by gpt-3.5-turbo. That is the best I can get. I tried gpt-3.5-turbo-0301 but it had lower performance.

Tianyu Gao · Answer 10 · Thu Aug 10 2023 19:32:16 GMT+0800 (China Standard Time)

Can you try other datasets / other configs?

Zhikun Xu · Answer 11 · Fri Aug 11 2023 15:45:59 GMT+0800 (China Standard Time)

Can you try other datasets / other configs?

Yes I tried other settings (like default and extraction). The 'default' performs in accordance with your result but the 'extraction' didn't. BTW, I notice the following data loading warning when running 'summary' and 'extraction' modes: would it be the lack of summary that make the chatgpt perform worse?

No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 8 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 7 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 7 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 1 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 6 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 8 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 5 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 8 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 9 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 4 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 3 documents.
No summary found in document. It could be this data do not contain summary or previous documents are not relevant. This is document 20. This question will only have 2 documents.
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 31725.52it/s]
Done.

Tianyu Gao · Answer 12 · Sun Aug 13 2023 21:33:14 GMT+0800 (China Standard Time)

Hi,

The warning message is normal. That means in the top100 passages we retrieved, there are fewer than 5 that ChatGPT thinks is "relevant". Can you try other datasets too? I am also puzzled by this (one of the problems to use API lol)...