Improper baseline

Question

Improper baseline

gitcnd opened this issue 4 months ago · comments

GPT4 beats everything with standard python libs.

You - the human doing these tests - should have written that code yourself first, and placed it in the rankings.

I realize it's going to ruin your paper, because plain-old-python regression algorithms will come first in every category, but, that's the point of science.

You need to put some genuine effort into boring procedural solutions to all your problems, and add those to your table. I'm not sure if you're even aware yourself, but this whole repo is a classic example of p-hacking... (some of the "answers" you're crediting the AIs with, when you look at how they solved it, show it had no idea so guessed (ran python code) and computed the "average" - which you then give it top scores for coming up with.

I realize this will sound rude, and it's totally not my intention to be so here (I'm genuinely trying to help), but you really need to take this work to a statistician or math professor and get them to explain what you're not understanding about why this work is p-hacking. Your career will thank you!

Robert Vacareanu · Answer 1 · Mon Apr 29 2024 07:47:25 GMT+0800 (China Standard Time)

@gitcnd, I do not think you are being rude. I appreciate your concerns and welcome this opportunity to clarify some misunderstandings highlighted in your message. Please also refer to FAQ.md.

Regarding: "GPT4 beats everything with standard python libs"

I would like to emphasize that in the experiments presented, GPT-4 did not have access to standard python libraries. GPT-4 and all other LLMs were used through the API solely for text completion. They did not write any code nor could they run any code. You can run it yourself, for example, using the following Google Colab GPT-4 Small Eval, where you will need an OpenAI API key. There, you will also be able to look at the complete generations of the LLMs. The prediction of the LLM is obtained in a way similar to this:

prediction = float(llm(prompt))

Please refer to src/regressors to see the exact code (OpenAI code here)

Moreover, the results presented cover a broad range of LLMs. Many other LLMs, such as Claude 3, DBRX, Mixtral8x22B, etc., show good performance. You can try yourself using DBRX (or Mixtral 8x22B) locally, for example.

To elaborate on why you might have observed GPT-4 generating Python code and running it in your local experiments: It is either (i) because you used GPT-4 on https://chat.openai.com/, with the Code Interpreter plugin enabled (please see https://openai.com/blog/chatgpt-plugins), or (ii) you used the Assistant API (link). In the experiments presented here, neither the Chat API nor the Assistant API were used. I solely used the LLMs for text completion. The LLMs did not write any code and could not run any code.

Regarding "plain-old-python regression algorithms will come first in every category"

In the experiments presented in the paper, we compare against many traditional regression algorithms (i.e., the ones you refer to as "plain-old-python regression algorithms"). For example, methods such as (i) Linear Regression, (ii) Multi-Layer Perceptron, (iii) Gradient Boosting, (iv) Random Forest, etc. These methods are widely used to tackle various regression tasks and are battle-tested. Nevertheless, I would be happy to run additional experiments.
I would like to emphasize, once again, that the message of the paper was not intended to be that LLMs should be used in place of traditionally supervised methods. Just that, without any additional training and with only in-context examples, they perform surprisingly well.

Please refer to the paper or to the heatmap available in README.md (https://github.com/robertvacareanu/llm4regression/blob/main/heatmap_all.png) to see comparisons between the LLMs and many traditional supervised methods, such as Gradient Boosting or Random Forest.

To reiterate, the large language models (LLMs) were compared against many widely used traditional supervised methods available in sklearn. For a complete list of the methods used, please refer to the paper or to the repository.

Robert Vacareanu · Answer 2 · Sat Jun 15 2024 11:52:36 GMT+0800 (China Standard Time)

@gitcnd I will close this issue.
For further understanding of how LLMs work and their applications as agents, I recommend reviewing the citations in our paper or recent surveys. The paper on the use of tools by LLMs (link) may also be insightful.

Chris Drake · Answer 3 · Sat Jun 15 2024 14:10:23 GMT+0800 (China Standard Time)

I recommend you take a course in statistics and learn why you're supposed to run manual/human/python tests first (preferably many times), before moving on to the LLMs (one time only).

That will of course be disappointing, and ruin your chances of PR, because nobody wants to read that LLMs are not as good as contemporary methods.

It's still worth it though, because it saves you drawing wrong conclusions and spreading those around. Or (worse for you) wasting your life on something that you didn't realize was not real.

If you don't understand the order (LLMs last) or timing (once) you might need to re-do your statistics course: it's subtle, and absolutely ruinous to get that part wrong - but most scientists still do. Most scientists also never publish all their mistakes and failed attempts - or in short- they "cherry pick" (ala P-hack) their way to results, having no clue whatsoever that they've turned out meaningless work as a result!

Statistics is everything.

Robert Vacareanu · Answer 4 · Thu Aug 08 2024 06:43:55 GMT+0800 (China Standard Time)

@gitcnd I think there is still some confusion. No LLM had access to any interpreter. The LLMs were used strictly for text completion, in a manner similar to:

prediction = float(llm(prompt))

All the models (e.g., GPT-4, Random Forests, etc) were tested on the exact same data, independently.

The code is public. If you think there is something wrong, you are more than welcome to run it yourself (or re-implement it). There is also a Google Colab GPT-4 Small Eval to serve as an example. If you want to try it on other datasets, there is a dedicated markdown file here explaining how to do this.

Also, see FAQ.md:

the message of the paper is not that LLMs are better than all traditional supervised method (e.g., Gradient Boosting) and that they should be used from now on. Instead, it highlights the surprisingly powerful in-context learning capabilities of pre-trained LLMs like GPT-4 and Claude 3. That is, despite no parameter update, LLMs can (sometimes) outperform methods like the ones aforementioned, at least in small dataset regime (We tested with at most 500 examples, as per Appendix O).

Thank you!

Chris Drake · Answer 5 · Fri Aug 09 2024 09:33:19 GMT+0800 (China Standard Time)

I suggest you talk to one of your professors and ask them to explain your error to you: as a "soon to graduate PhD" student, there appears to be a problem with your understanding of scientific test methods. This is not an insult. This is serious career advice - you need to get these kinds of tests right, and fully understand the way they should be run, or else most of the rest of your life in this field is all going to be wrong.

Do not reply. Arguing isn't going to help you figure out the subtle but critically important issues you need to work through to run tests like these.

Robert Vacareanu · Answer 6 · Fri Aug 09 2024 10:02:48 GMT+0800 (China Standard Time)

@gitcnd, I appreciate the points you have raised, but they are due to a misunderstanding on your part. Please consult the references in the paper, recent surveys, or the paper on the use of tools by LLMs (link).

Also, as mentioned in the previous message, the code is public. If you think there is something wrong, you are more than welcome to run it yourself (or re-implement it). There is also a Google Colab GPT-4 Small Eval to serve as an example. If you want to try it on other datasets, there is a dedicated markdown file here explaining how to do this.