How do I get only the CoT related data?

Question

How do I get only the CoT related data?

timothylimyl opened this issue a year ago · comments

In the FLAN papers, it was mentioned that using CoT data during instruction tuning improves the model performance on unseen tasks (more performant).

I am planning to curate some self-instruct dataset together with some CoT data. How can I specify in the repo to get only the CoT data? Can you also refer me to the CoT based datasets used that helped you get that extra improvement in performance that was mention in the FLAN papers?

I am also thinking of using self-instruct to generate some CoT data, seems like that is something that is missing from the FLAN methods. Any ideas or plans in this direction?

I was playing around with run_example.py. These are the kinds of input and output data I am looking for. I like that you do not need to have a base prompt template using FLAN style of instruction tuning compared to needing a rigid base prompt such as in .

Example 1:
input

Antonio is preparing a meal of spaghetti and meatballs for his family. His recipe for meatballs calls for 1/8 of a pound of hamburger per 

meatball. Antonio has 8 family members, including himself. If he uses 4 pounds of hamburger to make meatballs, and each member of 

the family eats an equal number of meatballs, how many meatballs will Antonio eat? Let's be accurate as possible.

output

If one meatball is made from 1 / 8 pound of hamburger meat, then 4 pounds of hamburger meat will make 4 / (1 / 8) = 4 * 8 = 32 

meatballs. 32 meatballs divided amongst 8 family members is 32 / 8 = 4 meatballs per family member.

The answer: 4.

Example 2:
input

Given the sentence "A man playing the guitar on an elevated stage in front an audience." is it true that "A man is about to play the final 

song in his set."? A step-by-step solution is:

Just because he is playing guitar doesn't mean he is playing the final song in his set.

The final answer: it is not possible to tell.

Example 3:
input

 Given the sentence "A toddler is riding a plastic scooter." is it true that "A toddler is taking a bath."? Step by step answer:

output

A toddler cannot be taking a bath and riding a scooter at the same time. Therefore, the final answer is no.

Lastly, do you think that the improvement in performance by CoT data is that it provides the model with extra reasoning capabilities that can be generalized to tasks that requires a few steps of reasoning.

shayne-longpre · Answer 1 · Sat Apr 15 2023 23:42:37 GMT+0800 (China Standard Time)

@timothylimyl Thank you for your interest.

The cot_submix in run_example.py is what you're looking for (or cot_zsopt if you just want the zero shot examples).
We don't currently have plans to use self-instruct for CoT data
Training with CoT data had greater effects on the explanation generation of larger models (Flan-PaLM) -- it's sort of an "emergent ability". I believe most of these CoT datasets have 1-2 steps of reasoning. So generalizing to more steps might require data with more steps, otherwise the model might simplify the reasoning steps into 1-2, as it has seen.

Best of luck!