SeungyounShin / Llama2-Code-Interpreter

Make Llama2 use Code Execution, Debug, Save Code, Reuse it, Access to Internet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to generate code-trajectory data with GPT4?

SeungyounShin opened this issue · comments

Creation of SFT data for

User :
Assistant :


<Thinking, GPT4>
<Debug...>


...

How to automate this process? with GPT4 and collect data efficiently?

@SeungyounShin Hi, I am currently working on a very similar project, mainly generating a dataset to use tools. One of the dataset I am working include using code interpreter tool. My method was basically starting with a few dozen of instruction and asked GPT-4 to generate more similar instructions. Using this slightly larger instructions set, I use the evol instruct [1] method to generate more instructions. So far I had only 4,628 instructions set about using code interpreter.

[1] WizardLM: Empowering Large Language Models to Follow Complex Instructions

TSLA_90days

Here's an output of the code generated by GPT-4 from my repository. The task was: "Can you plot the Tesla's 90-day volume with the mean of the closing price and a marker at 't' where the mean until 't-1' plus the standard deviation until 't-1' is less than the price at 't'?" The performance of GPT-4 is impressive but the data collection process tends to be slow. This is primarily because it operates in an iterative manner: generating code, executing it, then debugging and modifying the code, and repeating the process. This can lead to considerable latency.

Your method is a valuable alternative, but I believe the real-time execution of code between GPT-4 calls is critical for this task. I've encountered a second challenge: GPT-4 is effective at debugging but often struggles with generating the final answer. #2 I'm not entirely sure why this happens. I would appreciate any thoughts or suggestions on how to improve this process. Thank you so much! @theblackcat102

I would greatly appreciate any further discussion on this topic. Please feel free to share your insights or suggestions.

@SeungyounShin oh I had a code execution module as well, just the initial questions are generated via augmentation. Each round typically took me 20-120 seconds depends on complexity. My progress usually slows down due to bad for loop or training a 500M huggingface model on my mac.

What's the exact issue with #2 ? Could you provide more insights to the weird answer problem? An example would be nice 😊

@theblackcat102

I recently explored the concept of Evo-Instruct and found it quite fascinating. Inspired, I crafted my own version of Evo-Instruct. In the process, I observed that a significant number of human-engineered prompts are required. In addition, I noticed that GPT often tends to prompt with instructions like "Write ~" to create a Python function but does not actively check the result or implement it itself. It then appears to congratulate itself on completing the task.

One thing that stood out to me was that Evo-Instruct seems to perform better than Self-Instruct. It not only produces higher quality prompts but also a diverse range of them. While generating high-quality prompts is comparatively simpler (for instance, we could just request "more difficult one"), generating diverse prompts is quite challenging. Transitioning from one topic to another can potentially lead to significant deviations, such as moving from a simple '1+1=?' to a complex 'Use CAD to...'.

Considering these observations, it seems that maintaining a balance between diversity and quality could be an interesting research topic.

[Still in progress]

How we can enhance the generation of trajectories (code gen, exec, debug from it)