KeyError: 'reference_masked_solution'
zdaiot opened this issue · comments
when run
python pipeline/run_labeling.py \
--model_path <path to trtllm model> \
--server_type tensorrt_llm \
--output_dir ./synthetic-solutions/gsm8k-masked/ \
--num_gpus 8 \
--num_runs 128 \
+prompt=code_base \
++prompt.few_shot_examples.examples_type=gsm8k_text_with_code \
++prompt.context_type=masked_solution \
++dataset=gsm8k-masked \
++split_name=train_full
The following error occurred:
Traceback (most recent call last):
File "/code/nemo_skills/inference/generate_solutions.py", line 118, in generate_solutions
prompts.append(get_prompt(cfg.prompt, input_dict=data_point))
File "/code/nemo_skills/inference/prompt/utils.py", line 74, in get_prompt
filled_examples.append(prompt_config.template.format(context=context.format(**example_dict), **example_dict))
KeyError: 'reference_masked_solution'
I use version v0.1, This error seems to be because the example in the text_with_code
dictionary in the nemo_skills/inference/prompt/few_shot_examples/examples_gsm8k.py
file does not have a reference_masked_solution
field. Can you add it?
And The gsm8k-masked
and math-masked
datasets you provided are supposed to be reference_masked_solution
fields, but you seem to have used the masked_reference_solution
field.
Thanks @zdaiot for reporting this. Looks like we made multiple mistakes with this functionality when preparing the code for public release. Here is the PR that fixes the issue #27 and I'll also update v0.1 as soon as we merge that.
Sorry about this and please let us know if there are any other issues you're seeing!
Thank you for your reply. I have a few more questions.
- In the
pipeline/launcher.py
file, why re-buildlocal-sandbox- the {uuid.uuid4 ()}
image every time instead of always using the same image? - When executing
python nemo_skills/evaluation/evaluate_results.py prediction_jsonl_files=/results/output- rs0.jsonl
, I must reduce the UWSGI_PROCESSES indocker build- sandbox.sh
todocker build--tag ${SANDBOX_NAME}-- build-arg= "UWSGI_PROCESSES=$ ((
nproc-- all* 1)"-- build-arg= "UWSGI_CHEAPER=
nproc-- all"-f dockerfiles/Dockerfile.sandbox.
. Otherwise, I will encounter a mistake.
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.23.2</center>
</body>
</html>
- When generating data, the error
start.sh: line: kill:% 1: no such job
will be reported when executingstart.sh% 1
.
The solution is to addecho $! > / tmp/ mpirun.pid
after thempirun
command in thepipeline/run_ launcher.py
file, write the process number to/ tmp/ mpirun.pid
, and then change thekill% 1
inpipeline/run_ labeling.py
tokill $(cat / tmp/mpirun.pid)
. - Can change the
device_ map
innemo_skills/conversion/hf_to_ trtllm.py
tocpu
when you out of memory the model hf_to_trtllm all the time. It is useless to specify- load-model-on- cpu
and-- convert-model-on- cpu
. see NVIDIA/TensorRT-LLM#1440 (comment), see NVIDIA/TensorRT-LLM#1156
When I execute run_sft_and_eval
, when I save the model, I keep prompting me to out of memory.
But when I was training, the 8x80GB A100 only used half of the memory.
python pipeline/run_sft_and_eval.py --expname openmath-mistral-7b --nemo_model /dockerdata/zhaodali/.cache/nemo/Mistral-7B-v0.1 --stages sft prepare_eval --num_nodes 1 --num_gpus 8 --disable_wandb ++model.data.train_ds.file_path=/data/sft-data.jsonl ++trainer.sft.max_epochs=2 ++trainer.sft.val_check_interval=1000 ++model.tensor_model_parallel_size=4 ++model.pipeline_model_parallel_size=2 ++model.optim.lr=1e-6
I use config:
global_batch_size: 256
micro_batch_size: 8
And Error log is:
Training steps: 13%|█▎ | 1002/8000 [3:33:45<24:52:56, 12.80s/it, train_lr=1e-6, train_loss=0.299, train_consumed_samples=256512, train_step_time=22.4, train_epoch=1]
Error executing job with overrides: ['model.tensor_model_parallel_size=8', 'trainer.devices=8', 'trainer.num_nodes=1', 'model.restore_from_path=/nemo_model', 'model.data.validation_ds.file_path=/code/datasets/gsm8k/validation-sft.jsonl', 'exp_manager.create_wandb_logger=False', '+exp_manager.create_tensorboard_logger=True', 'exp_manager.name=openmath-mistral-7b', 'exp_manager.explicit_log_dir=/results', 'exp_manager.exp_dir=/results', '++exp_manager.max_time_per_run=10000:00:00:00', '++model.data.train_ds.file_path=/data/sft-data.jsonl', '++trainer.sft.max_epochs=2', '++trainer.sft.val_check_interval=1000', '++model.tensor_model_parallel_size=4', '++model.pipeline_model_parallel_size=2', '++model.optim.lr=1e-6']
Traceback (most recent call last):
File "/code/nemo_skills/finetuning/start_sft.py", line 260, in <module>
main()
File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
_run_hydra(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/code/nemo_skills/finetuning/start_sft.py", line 256, in main
sft_trainer.fit()
File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 186, in fit
loss, metrics = self.train_single_step(batch)
File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 129, in train_single_step
loss_mean, metrics = self.model.get_loss_and_metrics(batch=batch, forward_only=False)
File "/opt/NeMo-Aligner/nemo_aligner/models/nlp/gpt/gpt_sft_model.py", line 92, in get_loss_and_metrics
losses_reduced = fwd_bwd_function(
File "/opt/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1248, in forward_backward_pipelining_without_interleaving
output_tensor = forward_step(
File "/opt/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 192, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 1015, in fwd_output_and_loss_func
output_tensor = model(**forward_args)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/Megatron-LM/megatron/core/transformer/module.py", line 168, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 192, in forward
loss = self.compute_language_model_loss(labels, logits)
File "/opt/Megatron-LM/megatron/core/models/common/language_module/language_module.py", line 33, in compute_language_model_loss
loss = tensor_parallel.vocab_parallel_cross_entropy(logits.float(), labels)
File "/opt/Megatron-LM/megatron/core/tensor_parallel/cross_entropy.py", line 142, in vocab_parallel_cross_entropy
return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target, label_smoothing)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 551, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/Megatron-LM/megatron/core/tensor_parallel/cross_entropy.py", line 24, in forward
vocab_parallel_logits = vocab_parallel_logits - logits_max.unsqueeze(dim=-1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.03 GiB. GPU 6 has a total capacity of 79.33 GiB of which 1.56 GiB is free. Process 109591 has 77.74 GiB memory in use. Of the allocated memory 58.22 GiB is allocated by PyTorch, and 15.54 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The weight saved is very small and very incomplete. And it seems that before the model has been saved, we will move on to the next step of training. can you help me? Thanks a lot
thanks for reporting the other issues, @zdaiot. For your first problem with masked solutions, everything should be fixed now and you can use v0.1.1 branch, which is has the new recommended commit for reproducing our results. Let me comment on the other questions you asked
- In the pipeline/launcher.py file, why re-build local-sandbox- the {uuid.uuid4 ()} image every time instead of always using the same image?
This is done to make it easier to make changes to the sandbox locally. Otherwise you'd have to push a new container every time you change the sandbox code. It should be easy to override that path in the launcher.py if you want to use a fixed container instead.
- When executing python nemo_skills/evaluation/evaluate_results.py prediction_jsonl_files=/results/output- rs0.jsonl, I must reduce the UWSGI_PROCESSES in docker build- sandbox.sh to docker build--tag
${SANDBOX_NAME}-- build-arg= "UWSGI_PROCESSES=$ ((nproc-- all* 1)"-- build-arg= "UWSGI_CHEAPER=nproc-- all"-f dockerfiles/Dockerfile.sandbox. . Otherwise, I will encounter a mistake.
I cannot reproduce that issue, but it should be totally fine to make that change. Let me know if you see any other problems with sandbox because of that.
- When generating data, the error start.sh: line: kill:% 1: no such job will be reported when executing start.sh% 1.
The solution is to add echo$! > / tmp/ mpirun.pid after the mpirun command in the pipeline/run_ launcher.py file, write the process number to / tmp/ mpirun.pid, and then change the kill% 1 in pipeline/run_ labeling.py to kill $ (cat / tmp/mpirun.pid).
Thanks, will see if we can integrate this change! Need to test that it works both locally for us and on our internal cluster infrastructure
- Can change the device_ map in nemo_skills/conversion/hf_to_ trtllm.py to cpu when you out of memory the model hf_to_trtllm all the time. It is useless to specify - load-model-on- cpu and-- convert-model-on- cpu. see NVIDIA/TensorRT-LLM#1440 (comment), see NVIDIA/TensorRT-LLM#1156
Good to know, thanks for sharing this! We have not run out-of-memory in the conversion step in our experiments, but it's good to keep it as a reference for others facing the same issue
For the OOM during training, can you please try to change your config to have
micro_batch_size: 1
instead of 8? You can set validation interval to be very small, something like 10 and see if it works to have faster reproduction of the issue. Another potential thing is to use TP=8, PP=1, instead of TP=4, PP=2 as you currently have. You might even benefit (speed-wise) from using TP=4, PP=1, but it will use more memory, not less in that regime, so is not likely to help with the current problem. Please let us know if you're still facing OOM after these changes
For the OOM during training, can you please try to change your config to have
micro_batch_size: 1
instead of 8? You can set validation interval to be very small, something like 10 and see if it works to have faster reproduction of the issue. Another potential thing is to use TP=8, PP=1, instead of TP=4, PP=2 as you currently have. You might even benefit (speed-wise) from using TP=4, PP=1, but it will use more memory, not less in that regime, so is not likely to help with the current problem. Please let us know if you're still facing OOM after these changes
I still facing OOM. I use main
branch:
python pipeline/run_sft_and_eval.py \
--expname openmath-mistral-7b \
--nemo_model /dockerdata/zhaodali/.cache/nemo/Mistral-7B-v0.1 \
--stages sft prepare_eval \
--num_nodes 1 \
--num_gpus 8 \
--disable_wandb \
++model.data.train_ds.file_path=/data/sft-data.jsonl \
++trainer.sft.max_epochs=1 \
++trainer.sft.val_check_interval=200 \
++model.tensor_model_parallel_size=8 \
++model.pipeline_model_parallel_size=1 \
++model.optim.lr=1e-6
Errer log is (I upload it as attch file):
And When I am training, the video memory usage is as follows:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A800-SXM4-80GB On | 00000000:0E:00.0 Off | 0 |
| N/A 52C P0 221W / 400W | 31903MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM4-80GB On | 00000000:13:00.0 Off | 0 |
| N/A 49C P0 242W / 400W | 32209MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A800-SXM4-80GB On | 00000000:4B:00.0 Off | 0 |
| N/A 47C P0 235W / 400W | 32161MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A800-SXM4-80GB On | 00000000:51:00.0 Off | 0 |
| N/A 55C P0 244W / 400W | 31669MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A800-SXM4-80GB On | 00000000:93:00.0 Off | 0 |
| N/A 53C P0 236W / 400W | 32203MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A800-SXM4-80GB On | 00000000:99:00.0 Off | 0 |
| N/A 47C P0 234W / 400W | 31981MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A800-SXM4-80GB On | 00000000:CB:00.0 Off | 0 |
| N/A 51C P0 238W / 400W | 32189MiB / 81920MiB | 99% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A800-SXM4-80GB On | 00000000:D0:00.0 Off | 0 |
| N/A 49C P0 220W / 400W | 31853MiB / 81920MiB | 62% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
There seem to be two errors,
out of memory
.Not a directory:'/nemo_model/model_config.yaml'
,/nemo_model/
is a file. Do I need to perform thetar -xvf Mistral-7B-v0.1
operation after the checkpoint convert, and then pass in the decompressed folder path when sft?
When I use v0.1.1. still has same issue, out of memory.
train1.log
can you help me? Thanks a lot
Let me try to reproduce this today. It's always best to untar the .nemo checkpoint (it's just a tar archive) as it will save the time to load it. If you are able to use more than 1 node, that will certainly help, but I think mistral-7b should be trainable on 1 node, so will double check on my side
sorry, didn't get to try this today, hopefully will have some time tomorrow. Meanwhile, can you please try to set data.train_ds.max_seq_length=1024
and see if that helps?
sorry, didn't get to try this today, hopefully will have some time tomorrow. Meanwhile, can you please try to set
data.train_ds.max_seq_length=1024
and see if that helps?
-
Thank you for your guidance. I can barely train by using
data.train_ds.max_seq_length= 1024
.
But I have observed that at the beginning, training takes up half of the GPU memory (31881MiB / 81920MiB).
During the first validation, the GPU memory increased slightly (37877MiB / 81920MiB).
After saving checkpoint for the first time, and then training, the GPU memory doubles directly (81093MiB / 81920MiB).
After that, I can barely train, and I will continue to observe.I would like to ask why there is a doubling of GPU memory, is the memory leak? I use
ps-axu | grep python
, and there will be a lot of processes. -
I would also like to ask why you did not add the following
CODE_SEPARATORS
andCODE_OUTPUT_SEPARATORS
as a special token during training.
CODE_SEPARATORS = ('<llm-code>','</llm-code>') # used to execute code within these tags.
CODE_OUTPUT_SEPARATORS = ('<llm-code-output>','</llm-code-output>') # used to extract the code output
Now,such as <llm-code>
will be converted to several token id.
-
I would also like to ask, what is the difference between
main
branch andv0.1.1
? Can I use themain
branch to reproduce the results? -
I would also like to ask why
text_end.endswith(CODE_SEPARATORS[-1]
is needed here and then call sandbox.
According to the following training data, the "output" fields not end with CODE_SEPARATORS[-1]
.
Then the data generated by the trained model will not end with CODE_SEPARATORS[-1]
?
{
"question": "Harry has 50 books in his library. His sister Flora has twice as many books and their cousin Gary has half the books Harry has. How many books do the three of them own together?",
"expected_answer": "175",
"predicted_answer": "175",
"error_message": "",
"is_correct": true,
"generation_type": "masked_reference_solution",
"dataset": "gsm8k",
"input": "System:\nYou're an expert Python programmer and mathematician. Help the user to solve this problem using code when necessary. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nUser:\nHarry has 50 books in his library. His sister Flora has twice as many books and their cousin Gary has half the books Harry has. How many books do the three of them own together?\n\nAssistant:\n",
"output": "Let's solve this problem using Python code.\n<llm-code>\nlibrary_books = 50\nflora_books = library_books * 2\ngary_books = library_books / 2\ntotal_books = flora_books + gary_books + library_books\ntotal_books\n</llm-code>\n<llm-code-output>\n175.0\n</llm-code-output>\nThus Harry, Flora, and Gary have \\boxed{175} books in total."
}
- Why do you train 4 epoch for Mistral-7B and only 2 epoch for Llama2-70B?
Can I train only 2 epoch of Mistral-7B?
Thanks a lot
Ok, if it's training with new max seq length, it means that you're just running OOM because it's a single node and probably the only way to fix it is to run on multiple nodes. But there should not be too many instances in our dataset that are > 1024, so hopefully this wouldn't affect the final results too much.
To answer your questions:
- It's most likely because you encounter a longer sequence instances at a later time in training and that's why you see a memory spike. The code might also take more memory during checkpoint saving, although I'm not sure about that. In general, PyTorch will not (by default) return any memory that it allocates (it will pre-allocate large chunks and hold to them to optimize performance), so the memory usage reported through nvidia-smi will only go up, not down. It does not mean that all of that memory is actually in use at any given moment in training.
- It's certainly an option and will make inference more efficient. We just didn't want to change the tokenizer of the original models, so that's why we didn't do that.
- I'd recommend sticking to v0.1.1 for now as we are going to keep rapidly changing the main branch in the near future as we add more features and refactor our code. We will make a new v0.2 release when our code stabilizes and we test that everything works properly.
- CODE_SEPARATORS[-1] is equal to
</llm-code>
unless you changed something. So that's where we want to stop LLM generation and send the results to the sandbox to get the execution output. - Mostly because we didn't have enough compute and time. In general training for 2 epochs is going to be very close to training for 4 epochs, so you should get something very close to our results if you keep it as 2.
Thanks a lot.
+prompt_type=code_base \
++prompt.examples_type=gsm8k_text_with_code \
++prompt.num_few_shots=5
It seems that it should be
+prompt=code_base \
++prompt.examples_type=gsm8k_text_with_code \
++prompt.num_few_shots=5
-
Why didn't want to change the tokenizer of the original models? What is the underlying reason here?
-
The Mistral-7B only trained for 2 epochs, and the performance is not very good, with only a 67.7% accuracy rate on gsm8k. I will continue to train for an additional two epochs to see if it improves.
-
I want to know if this project supports lora training? Do I just need to change
peft.peft_scheme
to lora? -
May I ask if you have any documentation on PyTorch memory allocation strategies? I would like to delve deeper into it.
- You're right, thanks for catching that!
- Tbh, we just didn't explore that, so I'm not sure if it's easy or hard to do. If the tokenizer already has some extra tokens that are not used, it should be very easy to just set our code tokens to those extra ones. But if it does not have any untrained tokens, you'd need to change the embedding dimension and that is more involved and might break some compatibility with other packages.
- Is that on the validation or test set? That does not look right - are you changing any other parameters/code or just running our provided commands directly, but on a single node?
- We never tried it, but it should work. Please let us know if you run into any errors.
- This is probably a good start https://pytorch.org/docs/stable/notes/cuda.html#memory-management and you can follow the links there to get more in-depth information.
- You're right, thanks for catching that!
- Tbh, we just didn't explore that, so I'm not sure if it's easy or hard to do. If the tokenizer already has some extra tokens that are not used, it should be very easy to just set our code tokens to those extra ones. But if it does not have any untrained tokens, you'd need to change the embedding dimension and that is more involved and might break some compatibility with other packages.
- Is that on the validation or test set? That does not look right - are you changing any other parameters/code or just running our provided commands directly, but on a single node?
- We never tried it, but it should work. Please let us know if you run into any errors.
- This is probably a good start https://pytorch.org/docs/stable/notes/cuda.html#memory-management and you can follow the links there to get more in-depth information.
I use the first and second steps here to prepare the data. Step3 to Convert the model
I use the following commands for training, change data.train_ds.max_seq_length=1024
, train on 8x80GB A100 signle node.
python pipeline/run_sft_and_eval.py \
--expname openmath-mistral-7b \
--nemo_model /dockerdata/zhaodali/.cache/nemo_old/Mistral-7B-v0.1-untarred \
--stages sft prepare_eval \
--num_nodes 1 \
--num_gpus 8 \
--disable_wandb \
++model.data.train_ds.file_path=/data/sft-data.jsonl \
++trainer.sft.max_epochs=4 \
++trainer.sft.val_check_interval=600 \
++model.tensor_model_parallel_size=8 \
++model.pipeline_model_parallel_size=1 \
++model.optim.lr=1e-6
The final weight file is as follows:
'megatron_gpt_sft--val_code_generation_accuracy=0.929-step=3000-consumed_samples=384000-epoch=1'
'megatron_gpt_sft--val_code_generation_accuracy=0.940-step=2400-consumed_samples=307200-epoch=1'
'megatron_gpt_sft--val_code_generation_accuracy=0.943-step=3600-consumed_samples=460800-epoch=2'
'megatron_gpt_sft--val_code_generation_accuracy=0.949-step=4800-consumed_samples=614400-epoch=3'
'megatron_gpt_sft--val_code_generation_accuracy=0.950-step=4200-consumed_samples=537600-epoch=2'
'megatron_gpt_sft--val_code_generation_accuracy=0.956-step=5400-consumed_samples=691200-epoch=3'
'megatron_gpt_sft--val_code_generation_accuracy=0.958-step=6344-consumed_samples=812032-epoch=4'
'megatron_gpt_sft--val_code_generation_accuracy=0.958-step=6344-consumed_samples=812032-epoch=4-last'
'megatron_gpt_sft--val_code_generation_accuracy=0.959-step=6000-consumed_samples=768000-epoch=3'
The command used in the final test is
python pipeline/run_eval.py \
--model_path /dockerdata/zhaodali/.cache/nemo_old/nemo_skills_results/nemo-skills-exps/checkpoints/openmath-mistral-7b/openmath-mistral-7b.nemo \
--server_type nemo \
--output_dir `pwd`/openmath-mistral-7b-eval-results \
--benchmarks gsm8k:0 \
--num_gpus 8 \
--num_nodes 1 \
+prompt=code_sfted \
++prompt.num_few_shots=0 \
++split_name=test \
++server.max_code_executions=6 \
++server.stop_on_code_error=False \
++batch_size=64
python pipeline/summarize_results.py `pwd`/openmath-mistral-7b-eval-results
After training 4 epoch, the results are as follows (I tried to average different steps weight files, and the results were not much different.):
benchmark,decoding,num_entries,correct_answer,wrong_answer,no_answer
gsm8k,greedy,1319,70.20,28.96,0.83
I tested the weight you opened up(OpenMath-Mistral-7B ), result is:
benchmark,decoding,num_entries,correct_answer,wrong_answer,no_answer
gsm8k,greedy,1319,79.91,19.26,0.83
Is this because of the modification of data.train_ds.max_seq_length= 1024
?
But it seems that only a few data lengths are more than 1024. Is there any parameter that can directly throw away training samples more than 1024 in length?
Thanks a lot
Somehow your epoch count is not matching the size of the dataset. Can you please check how many elements are in the sft-data.jsonl? It should be 1024000 and then the global batch size is 128, so 1024000 / 128 = 8000. So it should be 8000 samples per epoch and you're only getting half of that if not less
By the way, these parameters
++server.max_code_executions=6 \
++server.stop_on_code_error=False \
are only supported for trtllm model, for nemo model (in v0.1.1) they are simply ignored. But they only make a difference of 1-2%, not more, so certainly there is some bigger issue here
Let me also try to run the same commands as you're doing on my side to see if I can reproduce this regression. I'd be really surprised if it's caused by the max_seq_length, so want to understand what's going on here
another question is why do you set val_check_interval=600
? This causes you to save too many checkpoints and we only keep 8, so they will not be equally spaced. We found that it's generally best to save at equal intervals 4-8 checkpoints and then average all of them
another question is why do you set
val_check_interval=600
? This causes you to save too many checkpoints and we only keep 8, so they will not be equally spaced. We found that it's generally best to save at equal intervals 4-8 checkpoints and then average all of them
Somehow your epoch count is not matching the size of the dataset. Can you please check how many elements are in the sft-data.jsonl? It should be 1024000 and then the global batch size is 128, so 1024000 / 128 = 8000. So it should be 8000 samples per epoch and you're only getting half of that if not less
Thank you very much, my open-math-instruct-1/sft-data.jsonl
file is corrupted(the reason I use val_check_interval=600
) and I regenerated it again. But it will take about 10 days to complete the training, and I may first experiment on a small datasets.
- I would like to ask, during the training process, the
model.data.validation_ds.file_path
isdatasets/gsm8k/train_ sft.jsonl
, but the OpenMathInstruct-1 should be generated usingdatasets/gsm8k/train_ full.jsonl
, so the sft training set seems to contain the validation_ds set. - Can you open scripts from synthetic-solutions to open-math-instruct-1?
When I have generated the data, I just need to put all thejsonl
file paths in thesynthetic-solutions
folder in thepreprocessed_dataset_files
parameter in the following command?
python nemo_skills/finetuning/prepare_sft_data.py \
++preprocessed_dataset_files="xxxx" \
++output_path=$NEMO_SKILLS_DATA/sft-data.jsonl \
++downsampling_method=fair \
++num_output_samples=1024000 \
++text_filter_type=any_code \
++trim_solutions=True
I would like to ask, during the training process, the model.data.validation_ds.file_path is datasets/gsm8k/train_ sft.jsonl, but the OpenMathInstruct-1 should be generated using datasets/gsm8k/train_ full.jsonl, so the sft training set seems to contain the validation_ds set.
That's right, for the final model training round we combine both "train" and "validation" subsets to be able to compare to prior works (as gsm8k does not have a standard validation split and other people use full training data to run their experiments). But if you're going to do research and run hyper-parameter searches (and only need to compare to your own baseline), we'd recommend you only use the "train" subset and evaluation on the "validation".
If you're running on train_full subset and want to make things a bit faster you can also set trainer.sft.limit_val_batches=1
(integer value) to only run validation on 1 batch as those numbers are not meaningful anyway. And also set validation batch size to be 8 (should be at least number of GPUs I think) and potentially decrease maximum sequence length inside model.inference.length_params.max_length. That will ensure that your validation does not slow down training anymore (I don't think you can fully disable it, so just need to make sure it's very fast).
Can you open scripts from synthetic-solutions to open-math-instruct-1?
When I have generated the data, I just need to put all the jsonl file paths in the synthetic-solutions folder in the preprocessed_dataset_files parameter in the following command?
I don't fully understand the question. But if you generate new data, you should use prediction_jsonl_files
parameter instead of the preprocessed_dataset_files
. You can then pass in generated output-rs*.jsonl to that script and it will prepare the data for sft. I guess we missed that part in the docs - will add it somewhere.
If you want to use a subset you can potentially only limit yourself to gsm8k, so manually prepare a subset with "dataset": "gsm8k"
and set num_output_samples to 256000 or even 128000. For 256K I think you should be getting within 1-2% of our best results for gsm8k. For 128K the regression might be larger, but shouldn't be more than 5% or so. You can also try to set tensor_parallel_size to be 4 or even 2 to make the job run faster, but I'm not fully sure about the effect of that.
Special thanks to you, I found that by using the 61K "dataset": "gsm8k", the accuracy of Mistral-7B can reach 74.53%.
Also, I wanted to ask, have you ever encountered Error: no such file /tmp/server_logs.txt
when executing the command below? This is because tail -n0 -f /tmp/server_logs.txt | sed '/Running on all addresses/ q'
is executed immediately after {server_start_cmd} starts running.
export PYTHONPATH=$PYTHONPATH:/code && \
{server_start_cmd} && \
if [ $SLURM_LOCALID -eq 0 ]; then \
echo "Waiting for the server to start" && \
tail -n0 -f /tmp/server_logs.txt | sed '/Running on all addresses/ q' && \
My solution is:
export PYTHONPATH=$PYTHONPATH:/code && \
{server_start_cmd} && sleep 10 && \
if [ $SLURM_LOCALID -eq 0 ]; then \
echo "Waiting for the server to start" && \
tail -f /tmp/server_logs.txt | sed '/Running on all addresses/ q' && \
Great to hear you're getting some good results! I also double checked that max_seq_length=1024 shouldn't affect the final results significantly. After using it + running for 1 epoch on full dataset, I was able to get the following (using nemo eval):
Running compute_metrics.py for math
2024-05-27 10:37:57 INFO Greedy results
2024-05-27 10:37:58 INFO Evaluation results for ['../experiments/nemo-skills-exps/results/openmath-mistral-7b-repro/math/output-greedy.jsonl']
2024-05-27 10:37:58 INFO Total eval entries: 5000
2024-05-27 10:37:58 INFO Correct answer: 41.16%
2024-05-27 10:37:58 INFO Wrong answer: 41.14%
2024-05-27 10:37:58 INFO No answer: 17.70%
2024-05-27 10:37:58 INFO Running compute_metrics.py for gsm8k
2024-05-27 10:37:58 INFO Greedy results
2024-05-27 10:37:58 INFO Evaluation results for ['../experiments/nemo-skills-exps/results/openmath-mistral-7b-repro/gsm8k/output-greedy.jsonl']
2024-05-27 10:37:58 INFO Total eval entries: 1319
2024-05-27 10:37:58 INFO Correct answer: 79.08%
2024-05-27 10:37:58 INFO Wrong answer: 19.64%
2024-05-27 10:37:58 INFO No answer: 1.29%
2024-05-27 10:37:58 INFO benchmark,decoding,num_entries,correct_answer,wrong_answer,no_answer
2024-05-27 10:37:58 INFO math,greedy,5000,41.16,41.14,17.70
2024-05-27 10:37:58 INFO gsm8k,greedy,1319,79.08,19.64,1.29
By the way, if you can identify a "smart" small subset of our dataset that gives similar results to using full data, that will be a great research contribution. We think that by smartly selecting a subset of solutions (not just randomly), it should be possible to significantly reduce the data size without any impact on the final results, but we don't have any experiments to demonstrate that yet.
We never saw an error you're mentioning, but if sleep for a few seconds works, that's great. Probably means that your filesystem is somehow slow, since the output file should be created immediately after the server command is launched. Let me add a sleep for a few seconds to our code, so that other people don't face the same issue.
Let me close this issue as I think all of the questions are resolved. Please feel free to open new issue/discussion if any more problems arise.
Thanks a lot
@Kipok
Hello, I have a few more questions to ask.
- During SFT training: Is
inference.strategy
only used in the validation phase? That is to say, during the training phase, the model will predictllm-code-output
and also calculate the loss. But in the validation phase,llm-code-output
is generated with the help of the sandbox. - When validating on the test set, the following logs are generated. Which function prints them?
[pid: 12|app: 0|req: 94/115] 172.17.0.5 () {40 vars in 524 bytes} [Mon Jun 3 07:40:41 2024] PUT /execute_code => generated 302 bytes in 36 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
172.17.0.5 - - [03/Jun/2024:07:40:41 +0000] "PUT /execute_code HTTP/1.1" 200 302 "-" "python-requests/2.31.0" "-"
- After SFT, how to test the performance of the model without the help of the sandbox?
- When validating on the test set, are the results of nemo and tensort-llm the same? I see
https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/docs/evaluation.md
uses nemo, buthttps://github.com/Kipok/NeMo-Skills/blob/v0.1.1/docs/reproducing-results.md
uses tensort-llm. - Why set
kwargs['handle_code_execution'] = False
When use Nemo? https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/nemo_skills/inference/server/model.py#L262
Thanks again
- That's right, the model will be trained on the code outputs in the same way as on any other tokens. We didn't really check if it's helpful or harmful. It is possible to change this, but we don't directly support it in our code. During testing/validation the code output is always added via sandbox.
- This is printed from the sandbox server (default flask logs). If you don't want to see sandbox logs, you can disable them with something like this https://stackoverflow.com/questions/14888799/disable-console-messages-in-flask-server here https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/nemo_skills/code_execution/local_sandbox/local_sandbox_server.py#L28
- I'm not sure why you would want to do that. If you run it without sandbox, it will still generate llm-code tokens, but instead of seeing the ground-truth output, it will just hallucinate the llm-code-output tokens and everything in-between them. So it will most likely be performing quite poorly. If you still want to do that, remove inference strategy from here https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/nemo_skills/inference/server/serve_nemo.py#L137 and line 144 there as well. Or set handle_code_execution to False for trtllm if that's what you're using to eval
- In the eval docs nemo is just used as an example, you can change it to trtllm provided you convert the checkpoint first. The results will not be the same, but should be within 1-2% difference from each other. Trtllm is just way faster and also supports continuing execution after code errors, so that's why we used it for our final evaluations.
- That's because our code execution implementation for nemo is done directly inside the generation code. This means that we don't need to make a second call after getting sandbox execution results and can re-use the kv-cache of the existing generation. Basically, you can set it to True if you want and you would get the same results, just much slower.