LLama 2 compatibility

Question

LLama 2 compatibility

chris-aeviator opened this issue a year ago · comments

Given the restricted access to LLama 1, is it possible to use llm-reasoner examples with LLama 2 out of the box?

Shibo Hao · Answer 1 · Mon Jul 31 2023 04:38:27 GMT+0800 (China Standard Time)

Hi @chris-aeviator, thanks for the suggestion! We are working on supporting Llama-2, and hopefully, it'll be ready within one week.

Btw, before we integrate the official implementation of Llama-2, you can also use Llama-2 with our llama.cpp interface, as @hilarious-viking suggested below.

hilarious-viking · Answer 2 · Wed Aug 02 2023 02:31:31 GMT+0800 (China Standard Time)

Executing now gsm8k using llama.cpp: wizardlm-13b-v1.2.ggmlv3.q8_0 (fine tuned Llama 2: WizardLM)

Case #1: correct=True, output='18', answer='18' ; accuracy=1.000 (1/1)
...
Case #115: correct=False, output='50', answer='6' ; accuracy=0.513 (59/115)
Case #116: correct=False, output='80', answer='90' ; accuracy=0.509 (59/116)
Case #117: correct=False, output='221.2', answer='42' ; accuracy=0.504 (59/117)
Case #118: correct=True, output='360', answer='360' ; accuracy=0.508 (60/118)
Case #119: correct=False, output='8', answer='4' ; accuracy=0.504 (60/119)

that places results somewhere around LLaMA 65B/Text-davinci-002-175B, impressive considering that Llama 2 70B GSM8K (8-shot) gets 56.8, interesting to see results of Llama 2 70B with this method...

Few questions:

why temperature 0.8 is used ? for reproducible results is it better to use 0.0, no?
to solve Math questions RAP was modified from state/action/reward perspectives, if that is the case, is it possible to generalise RAP ? One RAP to rule the world, or it's a theme for next paper ?)
I see that evolution of gsm8k is pretty slow, I suppose it's because of RAP-Aggregate ?
any plans to provide langchain integration, would be nice to see agent of that type there

Thank you !

Shibo Hao · Answer 3 · Wed Aug 02 2023 17:56:17 GMT+0800 (China Standard Time)

Executing now gsm8k using llama.cpp: wizardlm-13b-v1.2.ggmlv3.q8_0 (fine tuned Llama 2: WizardLM)
Case #1: correct=True, output='18', answer='18' ; accuracy=1.000 (1/1)
...
Case #115: correct=False, output='50', answer='6' ; accuracy=0.513 (59/115)
Case #116: correct=False, output='80', answer='90' ; accuracy=0.509 (59/116)
Case #117: correct=False, output='221.2', answer='42' ; accuracy=0.504 (59/117)
Case #118: correct=True, output='360', answer='360' ; accuracy=0.508 (60/118)
Case #119: correct=False, output='8', answer='4' ; accuracy=0.504 (60/119)
that places results somewhere around LLaMA 65B/Text-davinci-002-175B, impressive considering that Llama 2 70B GSM8K (8-shot) gets 56.8, interesting to see results of Llama 2 70B with this method...

Few questions:

why temperature 0.8 is used ? for reproducible results is it better to use 0.0, no?

to solve Math questions RAP was modified from state/action/reward perspectives, if that is the case, is it possible to generalise RAP ? One RAP to rule the world, or it's a theme for next paper ?)

I see that evolution of gsm8k is pretty slow, I suppose it's because of RAP-Aggregate ?

any plans to provide langchain integration, would be nice to see agent of that type there

Thank you !

Hi @hilarious-viking, thank you for sharing the results!

Our method needs to randomly sample multiple different subquestions/subanswers, so that we can search for the optimal reasoning path. If using 0.0 temperature, all the generated contents will be the same and we won't have candidates to choose from.
Great question! I think there is a trade-off: The most general formulation of a world model is to simply append the new action to a state: $s_t = [s_{t-1}; a_t]$, and we can still use advanced search algorithms and achieve non-trivial performance, as shown by the other examples (1, 2). However, to fully take advantage of RAP, it's preferred to formulate problems in a specific way, e.g. for blocksworld, defining the state as the block condition will be much more beneficial.
It's because the reward calculation involves sampling multiple answers and doing a majority vote. We find it to be a pretty reliable reward, but the price is its cost. This shouldn't be relevant to RAP-Aggregate. We haven't included the script to do aggregation, but it's only a simple traversal of the saved tree and won't take much time.
Thanks for your advice, but for now I am not sure about how we can integrate langchain and what LLM Reasoners can benefit from it. Could you be more specific?

hilarious-viking · Answer 4 · Thu Aug 03 2023 04:32:49 GMT+0800 (China Standard Time)

Thank you @Ber666 for answers, after playing a little with RAP, it's totally make sense.

May I suggest for Figure 7, RAP(1) is getting score 40 with 1 iteration, essentially it's means Few-shots with k=4 + CoT (? 1 iteration means you asked 1 question, get response, than you ask for final answer). So, only from that point (score 40) you are getting advantage from RAP. And in this regards you need to show on chart Few-shots with k=0,1,2,3,4, to see relative RAP gains, otherwise it's misleading, suggesting that RAP can achieve great result with 1 iteration, where all gains comes from Few-shots.

But regardless of that suggest, I would like to acknowledge that results are impressive, with very small subset of GSM8K and StableBeluga2-70B + CoT using langchain + tools (WolframAlpha, python, wikipedia, search etc.) I got around 50% ~= similar to yesterday posted wizardlm-13b-v1.2.ggmlv3.q8_0 + RAP

Example output of StableBeluga2-70B + langchain + tools:

Question: Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy.  She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed.  In the afternoon, she gives her chickens another 25 cups of feed.  How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi's flock is 20 chickens?
Thought: I should calculate how much food each chicken gets per meal and then find out how much food they get in total for the first two meals. Then, I can subtract that from the total amount of feed given to all chickens (60 cups) to find out how much is left for the final meal.
Action: WolframAlpha
Action Input: 15 + 25 =
Observation: Assumption: 15 + 25 
Answer: 40
Thought: Now I know that Wendi gives her chickens a total of 40 cups of feed in the first two meals.
Action: WolframAlpha
Action Input: 3 * 20 =
Observation: Assumption: 3×20 
Answer: 60
Thought: Now I know that Wendi gives her chickens a total of 60 cups of feed per day.
Action: WolframAlpha
Action Input: 40 / 20 =
Observation: Assumption: 40/20 
Answer: 2
Thought: Now I know that each chicken gets 2 cups of food per meal.
Action: WolframAlpha
Action Input: 60 - 40 =
Observation: Assumption: 60 - 40 
Answer: 20
Thought:
Generated:
 Now I know that Wendi needs to give her chickens a total of 20 cups of feed in the final meal.
Final Answer: Wendi needs to give her chickens 20 cups of feed in the final meal of the day.

I'm thinking about mixing langchain tools with RAP, so instead of Few-shots, provide tools available. Llama 2 is much better with logic, I suspect that Few-shots could be less important for it than for Llama 1, testing now RAP with k=0 and 1
Your project can benefit from it as it's very high profile project, as llama.cpp is, if your method increase reasoning capabilities on small models (7B/13B), it will be highly used as it can run on commodity hardware (24G vram), and having more users, means more stars for your project, paper citation, projects will start to optimise perf for your method (indeed eval in llama.cpp is pretty slow, I suspect it's not done on CUDA kernel, but CPU only)

Shibo Hao · Answer 5 · Thu Aug 03 2023 19:19:56 GMT+0800 (China Standard Time)

Hi @hilarious-viking, thanks for your questions. I would like to clarify some points:

1 iteration means you asked 1 question, get response, than you ask for final answer

No. One iteration will lead to a chain without any branches (instead of a tree), but it still involves multiple steps of reasoning.

only from that point (score 40) you are getting advantage from RAP

Not exactly. It's fair to say that only from that point we are getting advantage from Monte-carlo Tree Search, but the RAP framework also include a world model formulation for numerical reasoning, i.e. asking subquestions and answer them in turn. It'll be beneficial even if we only use 1 iteration.

May I ask which argument are you talking about when you say k=...?

chris-aeviator · Answer 6 · Fri Aug 04 2023 01:52:46 GMT+0800 (China Standard Time)

I managed to get the llama 2 model "running" via transformers, needed to replace attention_masks or set them to None, replace model.forward() calls with model() calls and some changes in the logits processing, but this logits effort broke the successful prediction.

I'm completely impressed by LLama 1 predictions, basically the rap_gsm8 example allows to adapt to all kinds of reasoning scenarios (non-math) with high quality reasoning on a 7b model with little code & prompt changes.

hilarious-viking · Answer 7 · Fri Aug 04 2023 04:06:13 GMT+0800 (China Standard Time)

Hello @Ber666, thanks again for clarifications,

No. One iteration will lead to a chain without any branches (instead of a tree), but it still involves multiple steps of reasoning.

Got it, indeed 1 iter is one graph evaluation, I suppose the one dumped in .pkl is the one taken as solution but in reality graph is bigger, is it possible to easily dump full graph ?

May I ask which argument are you talking about when you say k=...?

in Few-shots k is number of examples you are showing in your prompt before asking your question, in gsm8k you are providing 4 examples (interactive_examples.json & useful_examples.json), 5th is gsm8k question itself, that why 40 base score is Few-shots k=4 w/ RAP(1), that why it would be interesting to see results with k=0,1,2,3 too, to isolate just RAP effectiveness from Few-shots (it's known that Few-shots provide good results to begin with In this paper, we investigated the effectiveness of few-shot training for code summarization task and found that it can significantly outperform a fine-tuned model trained with thousands of samples with just ten samples, + https://arxiv.org/pdf/2212.10559.pdf, https://arxiv.org/pdf/2212.07677.pdf)

basically the rap_gsm8 example allows to adapt to all kinds of reasoning scenarios (non-math) with high quality reasoning

@chris-aeviator doing the same, but with llama 2 13B, llama 2 is surprisingly good at following instructions

罗皓天 · Answer 8 · Fri Aug 04 2023 05:30:37 GMT+0800 (China Standard Time)

@chris-aeviator Hope this is not a late response. ✋ , can you check our experimental branch https://github.com/Ber666/llm-reasoners/tree/llama-2 . Recently we will merge it into main branch. You may need to pip install to setup package again for llama 2.

hilarious-viking · Answer 9 · Fri Aug 04 2023 11:20:51 GMT+0800 (China Standard Time)

@Ber666 here an updated about my experiments with gsm8k, executed:
RAP (1) w/o Few-shots (Zero-shot), score stabilises around: ~0.23
RAP (1) w/ Few-shots k=1 (One-shot): 0.40
RAP (1) w/ Few-shots k=4: ~0.41
RAP (10) w/ Few-shots k=4: ~0.47
Same mode: wizardlm-13b-v1.2.ggmlv3.q8_0
Params:

temperature: 0.8 , top_k: 40 , top_p: 0.95, for RAP (1) w/ Zero-shot
temperature: 0.8 , top_k: 40 , top_p: 0.95, for RAP (1) w/ One-shots
temperature: 0.8 , top_k: 40 , top_p: 0.95, for RAP (1) w/ Few-shots k=4
defaults for RAP (10) w/ Few-shots k=4

Without Few-shots (Zero-shot), responses could be without asked "The answer is", below Case #184 could not be parsed because answer was: "The answer is that they will win the race by two seconds.", which is actually correct, there just few empty results, I have checked them, and most are not correct anyway, Case #184 is an exception, so final score could be ~0.25, but it's far from what you have in paper 0.40. Also executed RAP (1) w/ Few-shots k=4, got ~0.41, so without Few-shots RAP would be likely lower than CoT + CS for gsm8k

RAP (1) w/o Few-shots (Zero-shot):

Case #180: correct=True, output='130', answer='130' ; accuracy=0.228 (41/180)
Case #181: correct=False, output='60', answer='20' ; accuracy=0.227 (41/181)
Case #182: correct=True, output='160', answer='160' ; accuracy=0.231 (42/182)
Case #183: correct=False, output='32', answer='23' ; accuracy=0.230 (42/183)
Case #184: correct=False, output='', answer='2' ; accuracy=0.228 (42/184)
Case #185: correct=False, output=None, answer='25' ; accuracy=0.227 (42/185)
Case #186: correct=True, output='30', answer='30' ; accuracy=0.231 (43/186)
Case #187: correct=False, output='6', answer='5' ; accuracy=0.230 (43/187)
Case #188: correct=True, output='106', answer='106' ; accuracy=0.234 (44/188)
Case #189: correct=False, output='20', answer='50' ; accuracy=0.233 (44/189)
Case #190: correct=False, output='29.50', answer='34' ; accuracy=0.232 (44/190)
Case #191: correct=False, output='36', answer='360' ; accuracy=0.230 (44/191)
Case #192: correct=False, output='35', answer='5' ; accuracy=0.229 (44/192)
Case #193: correct=False, output=None, answer='91' ; accuracy=0.228 (44/193)
Case #194: correct=False, output='18', answer='24' ; accuracy=0.227 (44/194)
Case #195: correct=False, output='0', answer='10' ; accuracy=0.226 (44/195)
Case #196: correct=True, output='12.00', answer='12' ; accuracy=0.230 (45/196)
Case #197: correct=False, output='3000', answer='120' ; accuracy=0.228 (45/197)

interactive_0_example.json

{
    "input": "Given a question, please decompose it into sub-questions. For each sub-question, please answer it in a complete sentence, ending with \"The answer is\". When the original question is answerable, please start the subquestion with \"Now we can answer the question: \".\n\n",
    "question_prefix": "Question 1: ",
    "subquestion_prefix": "Question 1.{}:",
    "overall_question_prefix": "Now we can answer the question:",
    "answer_prefix": "Answer 1.{}:",
    "index": 1
}

useful_0_example.json

{
  "input": "Given a question and some sub-questions, determine whether the last sub-question is useful to answer the question. Output 'Yes' or 'No', and a reason.\n\n",
  "question_prefix": "Question 1: ",
  "subquestion_prefix": "Question 1.{}:",
  "new_subquestion_prefix": "New question 1.{}:",
  "useful_prefix": "Is the new question useful?"
}

RAP (1) w/ Few-shots k=4;

Case #180: correct=False, output='160', answer='130' ; accuracy=0.422 (76/180)
Case #181: correct=True, output='20', answer='20' ; accuracy=0.425 (77/181)
Case #182: correct=False, output='5', answer='160' ; accuracy=0.423 (77/182)
Case #183: correct=False, output='24', answer='23' ; accuracy=0.421 (77/183)
Case #184: correct=False, output='8', answer='2' ; accuracy=0.418 (77/184)
Case #185: correct=False, output='14.7', answer='25' ; accuracy=0.416 (77/185)
Case #186: correct=True, output='30', answer='30' ; accuracy=0.419 (78/186)
Case #187: correct=False, output='3', answer='5' ; accuracy=0.417 (78/187)
Case #188: correct=False, output='102', answer='106' ; accuracy=0.415 (78/188)
Case #189: correct=False, output='40', answer='50' ; accuracy=0.413 (78/189)
Case #190: correct=False, output='16.75', answer='34' ; accuracy=0.411 (78/190)
Case #191: correct=True, output='360', answer='360' ; accuracy=0.414 (79/191)
Case #192: correct=True, output='5', answer='5' ; accuracy=0.417 (80/192)
Case #193: correct=False, output='406', answer='91' ; accuracy=0.415 (80/193)
Case #194: correct=False, output='12', answer='24' ; accuracy=0.412 (80/194)
Case #195: correct=False, output='0', answer='10' ; accuracy=0.410 (80/195)
Case #196: correct=True, output='12', answer='12' ; accuracy=0.413 (81/196)
Case #197: correct=False, output='0', answer='120' ; accuracy=0.411 (81/197)
...
Case #340: correct=False, output='120', answer='160' ; accuracy=0.421 (143/340)
Case #341: correct=False, output='2', answer='4' ; accuracy=0.419 (143/341)
Case #342: correct=False, output='160', answer='260' ; accuracy=0.418 (143/342)
Case #343: correct=True, output='87', answer='87' ; accuracy=0.420 (144/343)
Case #344: correct=False, output='30000', answer='180000' ; accuracy=0.419 (144/344)
Case #345: correct=True, output='2', answer='2' ; accuracy=0.420 (145/345)

RAP (1) w/ Few-shots k=1 (One-shot):

Case #340: correct=False, output='360', answer='160' ; accuracy=0.400 (136/340)
Case #341: correct=False, output='1', answer='4' ; accuracy=0.399 (136/341)
Case #342: correct=False, output='80', answer='260' ; accuracy=0.398 (136/342)
Case #343: correct=True, output='87', answer='87' ; accuracy=0.399 (137/343)
Case #344: correct=False, output='225000', answer='180000' ; accuracy=0.398 (137/344)
Case #345: correct=True, output='2', answer='2' ; accuracy=0.400 (138/345)
Case #346: correct=True, output='310', answer='310' ; accuracy=0.402 (139/346)
Case #347: correct=False, output='17', answer='9' ; accuracy=0.401 (139/347)
Case #348: correct=True, output='36', answer='36' ; accuracy=0.402 (140/348)
Case #349: correct=False, output='22', answer='10' ; accuracy=0.401 (140/349)
Case #350: correct=False, output='2400', answer='2640' ; accuracy=0.400 (140/350)

interactive_1_example.json

{
    "input": "Given a question, please decompose it into sub-questions. For each sub-question, please answer it in a complete sentence, ending with \"The answer is\". When the original question is answerable, please start the subquestion with \"Now we can answer the question: \".\n\nQuestion 1: Four years ago, Kody was only half as old as Mohamed. If Mohamed is currently twice as 30 years old, how old is Kody?\nQuestion 1.1: How old is Mohamed?\nAnswer 1.1: He is currently 30 * 2 = 60 years old. The answer is 60.\nQuestion 1.2: How old was Mohamed four years ago?\nAnswer 1.2: Four years ago, he must have been 60 - 4 = 56 years old. The answer is 56.\nQuestion 1.3: How old was Kody four years ago?\nAnswer 1.3: Kody was half as old as Mohamed four years ago. Thus, Kody was 56 / 2 = 28 years old. The answer is 28.\nQuestion 1.4: Now we can answer the question: How old is Kody?\nAnswer 1.4: She is currently 28 + 4 = 32 years old. The answer is 32.\n\n",
    "question_prefix": "Question 2: ",
    "subquestion_prefix": "Question 2.{}:",
    "overall_question_prefix": "Now we can answer the question:",
    "answer_prefix": "Answer 2.{}:",
    "index": 2
}

useful_1_example.json

{
  "input": "Given a question and some sub-questions, determine whether the last sub-question is useful to answer the question. Output 'Yes' or 'No', and a reason.\n\nQuestion 1: Four years ago, Kody was only half as old as Mohamed. If Mohamed is currently twice as 30 years old, how old is Kody?\nQuestion 1.1: How old is Mohamed?\nQuestion 1.2: How old was Mohamed four years ago?\nNew question 1.3: How old was Kody four years ago?\nIs the new question useful? Yes. We need the answer to calculate how old is Kody now.\n\n",
  "question_prefix": "Question 2: ",
  "subquestion_prefix": "Question 2.{}:",
  "new_subquestion_prefix": "New question 2.{}:",
  "useful_prefix": "Is the new question useful?"
}

罗皓天 · Answer 10 · Fri Aug 11 2023 02:28:03 GMT+0800 (China Standard Time)

@hilarious-viking Thanks for your kindly help. We have now updated for llama 2's integration the main branch. Hope it will be helpful for you. If anyone have question in experiment with llama 2, please directly contact with me.

罗皓天 · Answer 11 · Fri Aug 11 2023 02:36:45 GMT+0800 (China Standard Time)

so without Few-shots RAP would be likely lower than CoT + CS for gsm8k

I'm reproducing the work in the paper recently and I have to say that RAP's comparison is reasonable. Because the result in the paper for CoT and SC is also on 4-shot.