KeyError: 'reference_masked_solution'

Question

KeyError: 'reference_masked_solution'

zdaiot opened this issue 5 months ago · comments

when run

python pipeline/run_labeling.py \
  --model_path <path to trtllm model> \
  --server_type tensorrt_llm \
  --output_dir ./synthetic-solutions/gsm8k-masked/ \
  --num_gpus 8 \
  --num_runs 128 \
  +prompt=code_base \
  ++prompt.few_shot_examples.examples_type=gsm8k_text_with_code \
  ++prompt.context_type=masked_solution \
  ++dataset=gsm8k-masked \
  ++split_name=train_full

The following error occurred:

Traceback (most recent call last):
  File "/code/nemo_skills/inference/generate_solutions.py", line 118, in generate_solutions
    prompts.append(get_prompt(cfg.prompt, input_dict=data_point))
  File "/code/nemo_skills/inference/prompt/utils.py", line 74, in get_prompt
    filled_examples.append(prompt_config.template.format(context=context.format(**example_dict), **example_dict))
KeyError: 'reference_masked_solution'

I use version v0.1, This error seems to be because the example in the text_with_code dictionary in the nemo_skills/inference/prompt/few_shot_examples/examples_gsm8k.py file does not have a reference_masked_solution field. Can you add it?

And The gsm8k-masked and math-masked datasets you provided are supposed to be reference_masked_solution fields, but you seem to have used the masked_reference_solution field.

Igor Gitman · Answer 1 · Tue May 21 2024 04:20:12 GMT+0800 (China Standard Time)

Thanks @zdaiot for reporting this. Looks like we made multiple mistakes with this functionality when preparing the code for public release. Here is the PR that fixes the issue #27 and I'll also update v0.1 as soon as we merge that.

Sorry about this and please let us know if there are any other issues you're seeing!

zdaiot · Answer 2 · Tue May 21 2024 14:21:31 GMT+0800 (China Standard Time)

Thank you for your reply. I have a few more questions.

In the pipeline/launcher.py file, why re-build local-sandbox- the {uuid.uuid4 ()} image every time instead of always using the same image?
When executing python nemo_skills/evaluation/evaluate_results.py prediction_jsonl_files=/results/output- rs0.jsonl, I must reduce the UWSGI_PROCESSES in docker build- sandbox.sh to docker build--tag ${SANDBOX_NAME}-- build-arg= "UWSGI_PROCESSES=$ ((nproc-- all* 1)"-- build-arg= "UWSGI_CHEAPER=nproc-- all"-f dockerfiles/Dockerfile.sandbox. . Otherwise, I will encounter a mistake.

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.23.2</center>
</body>
</html>

When generating data, the error start.sh: line: kill:% 1: no such job will be reported when executing start.sh% 1.
The solution is to add echo $! > / tmp/ mpirun.pid after the mpirun command in the pipeline/run_ launcher.py file, write the process number to / tmp/ mpirun.pid, and then change the kill% 1 in pipeline/run_ labeling.py to kill $(cat / tmp/mpirun.pid).
Can change the device_ map in nemo_skills/conversion/hf_to_ trtllm.py to cpu when you out of memory the model hf_to_trtllm all the time. It is useless to specify - load-model-on- cpu and-- convert-model-on- cpu. see NVIDIA/TensorRT-LLM#1440 (comment), see NVIDIA/TensorRT-LLM#1156

zdaiot · Answer 3 · Tue May 21 2024 22:51:49 GMT+0800 (China Standard Time)

When I execute run_sft_and_eval, when I save the model, I keep prompting me to out of memory.
But when I was training, the 8x80GB A100 only used half of the memory.

python pipeline/run_sft_and_eval.py    --expname openmath-mistral-7b    --nemo_model /dockerdata/zhaodali/.cache/nemo/Mistral-7B-v0.1    --stages sft prepare_eval    --num_nodes 1    --num_gpus 8    --disable_wandb    ++model.data.train_ds.file_path=/data/sft-data.jsonl    ++trainer.sft.max_epochs=2    ++trainer.sft.val_check_interval=1000    ++model.tensor_model_parallel_size=4    ++model.pipeline_model_parallel_size=2    ++model.optim.lr=1e-6

I use config:

      global_batch_size: 256
      micro_batch_size: 8

And Error log is:

Training steps:  13%|█▎        | 1002/8000 [3:33:45<24:52:56, 12.80s/it, train_lr=1e-6, train_loss=0.299, train_consumed_samples=256512, train_step_time=22.4, train_epoch=1]  
Error executing job with overrides: ['model.tensor_model_parallel_size=8', 'trainer.devices=8', 'trainer.num_nodes=1', 'model.restore_from_path=/nemo_model', 'model.data.validation_ds.file_path=/code/datasets/gsm8k/validation-sft.jsonl', 'exp_manager.create_wandb_logger=False', '+exp_manager.create_tensorboard_logger=True', 'exp_manager.name=openmath-mistral-7b', 'exp_manager.explicit_log_dir=/results', 'exp_manager.exp_dir=/results', '++exp_manager.max_time_per_run=10000:00:00:00', '++model.data.train_ds.file_path=/data/sft-data.jsonl', '++trainer.sft.max_epochs=2', '++trainer.sft.val_check_interval=1000', '++model.tensor_model_parallel_size=4', '++model.pipeline_model_parallel_size=2', '++model.optim.lr=1e-6']
Traceback (most recent call last):
  File "/code/nemo_skills/finetuning/start_sft.py", line 260, in <module>
    main()
  File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/code/nemo_skills/finetuning/start_sft.py", line 256, in main
    sft_trainer.fit()
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 186, in fit
    loss, metrics = self.train_single_step(batch)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 129, in train_single_step
    loss_mean, metrics = self.model.get_loss_and_metrics(batch=batch, forward_only=False)
  File "/opt/NeMo-Aligner/nemo_aligner/models/nlp/gpt/gpt_sft_model.py", line 92, in get_loss_and_metrics
    losses_reduced = fwd_bwd_function(
  File "/opt/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1248, in forward_backward_pipelining_without_interleaving
    output_tensor = forward_step(
  File "/opt/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 192, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 1015, in fwd_output_and_loss_func
    output_tensor = model(**forward_args)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/transformer/module.py", line 168, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 192, in forward
    loss = self.compute_language_model_loss(labels, logits)
  File "/opt/Megatron-LM/megatron/core/models/common/language_module/language_module.py", line 33, in compute_language_model_loss
    loss = tensor_parallel.vocab_parallel_cross_entropy(logits.float(), labels)
  File "/opt/Megatron-LM/megatron/core/tensor_parallel/cross_entropy.py", line 142, in vocab_parallel_cross_entropy
    return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target, label_smoothing)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 551, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/Megatron-LM/megatron/core/tensor_parallel/cross_entropy.py", line 24, in forward
    vocab_parallel_logits = vocab_parallel_logits - logits_max.unsqueeze(dim=-1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.03 GiB. GPU 6 has a total capacity of 79.33 GiB of which 1.56 GiB is free. Process 109591 has 77.74 GiB memory in use. Of the allocated memory 58.22 GiB is allocated by PyTorch, and 15.54 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The weight saved is very small and very incomplete. And it seems that before the model has been saved, we will move on to the next step of training. can you help me? Thanks a lot

Igor Gitman · Answer 4 · Wed May 22 2024 07:24:46 GMT+0800 (China Standard Time)

thanks for reporting the other issues, @zdaiot. For your first problem with masked solutions, everything should be fixed now and you can use v0.1.1 branch, which is has the new recommended commit for reproducing our results. Let me comment on the other questions you asked

In the pipeline/launcher.py file, why re-build local-sandbox- the {uuid.uuid4 ()} image every time instead of always using the same image?

This is done to make it easier to make changes to the sandbox locally. Otherwise you'd have to push a new container every time you change the sandbox code. It should be easy to override that path in the launcher.py if you want to use a fixed container instead.

When executing python nemo_skills/evaluation/evaluate_results.py prediction_jsonl_files=/results/output- rs0.jsonl, I must reduce the UWSGI_PROCESSES in docker build- sandbox.sh to docker build--tag ${SANDBOX_NAME}-- build-arg= "UWSGI_PROCESSES=$ ((nproc-- all* 1)"-- build-arg= "UWSGI_CHEAPER=nproc-- all"-f dockerfiles/Dockerfile.sandbox. . Otherwise, I will encounter a mistake.

I cannot reproduce that issue, but it should be totally fine to make that change. Let me know if you see any other problems with sandbox because of that.

When generating data, the error start.sh: line: kill:% 1: no such job will be reported when executing start.sh% 1.
The solution is to add echo $! > / tmp/ mpirun.pid after the mpirun command in the pipeline/run_ launcher.py file, write the process number to / tmp/ mpirun.pid, and then change the kill% 1 in pipeline/run_ labeling.py to kill $(cat / tmp/mpirun.pid).

Thanks, will see if we can integrate this change! Need to test that it works both locally for us and on our internal cluster infrastructure

Can change the device_ map in nemo_skills/conversion/hf_to_ trtllm.py to cpu when you out of memory the model hf_to_trtllm all the time. It is useless to specify - load-model-on- cpu and-- convert-model-on- cpu. see NVIDIA/TensorRT-LLM#1440 (comment), see NVIDIA/TensorRT-LLM#1156

Good to know, thanks for sharing this! We have not run out-of-memory in the conversion step in our experiments, but it's good to keep it as a reference for others facing the same issue

Igor Gitman · Answer 5 · Wed May 22 2024 07:29:46 GMT+0800 (China Standard Time)

For the OOM during training, can you please try to change your config to have

micro_batch_size: 1

instead of 8? You can set validation interval to be very small, something like 10 and see if it works to have faster reproduction of the issue. Another potential thing is to use TP=8, PP=1, instead of TP=4, PP=2 as you currently have. You might even benefit (speed-wise) from using TP=4, PP=1, but it will use more memory, not less in that regime, so is not likely to help with the current problem. Please let us know if you're still facing OOM after these changes

zdaiot · Answer 6 · Wed May 22 2024 12:53:17 GMT+0800 (China Standard Time)

For the OOM during training, can you please try to change your config to have
micro_batch_size: 1
instead of 8? You can set validation interval to be very small, something like 10 and see if it works to have faster reproduction of the issue. Another potential thing is to use TP=8, PP=1, instead of TP=4, PP=2 as you currently have. You might even benefit (speed-wise) from using TP=4, PP=1, but it will use more memory, not less in that regime, so is not likely to help with the current problem. Please let us know if you're still facing OOM after these changes

I still facing OOM. I use main branch:

python pipeline/run_sft_and_eval.py \
   --expname openmath-mistral-7b \
   --nemo_model /dockerdata/zhaodali/.cache/nemo/Mistral-7B-v0.1 \
   --stages sft prepare_eval \
   --num_nodes 1 \
   --num_gpus 8 \
   --disable_wandb \
   ++model.data.train_ds.file_path=/data/sft-data.jsonl \
   ++trainer.sft.max_epochs=1 \
   ++trainer.sft.val_check_interval=200 \
   ++model.tensor_model_parallel_size=8 \
   ++model.pipeline_model_parallel_size=1 \
   ++model.optim.lr=1e-6

Errer log is (I upload it as attch file):

train_200.log

And When I am training, the video memory usage is as follows:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   52C    P0             221W / 400W |  31903MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM4-80GB          On  | 00000000:13:00.0 Off |                    0 |
| N/A   49C    P0             242W / 400W |  32209MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM4-80GB          On  | 00000000:4B:00.0 Off |                    0 |
| N/A   47C    P0             235W / 400W |  32161MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM4-80GB          On  | 00000000:51:00.0 Off |                    0 |
| N/A   55C    P0             244W / 400W |  31669MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A800-SXM4-80GB          On  | 00000000:93:00.0 Off |                    0 |
| N/A   53C    P0             236W / 400W |  32203MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A800-SXM4-80GB          On  | 00000000:99:00.0 Off |                    0 |
| N/A   47C    P0             234W / 400W |  31981MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A800-SXM4-80GB          On  | 00000000:CB:00.0 Off |                    0 |
| N/A   51C    P0             238W / 400W |  32189MiB / 81920MiB |     99%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A800-SXM4-80GB          On  | 00000000:D0:00.0 Off |                    0 |
| N/A   49C    P0             220W / 400W |  31853MiB / 81920MiB |     62%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

There seem to be two errors,

out of memory.
Not a directory:'/nemo_model/model_config.yaml', /nemo_model/ is a file. Do I need to perform the tar -xvf Mistral-7B-v0.1 operation after the checkpoint convert, and then pass in the decompressed folder path when sft?

zdaiot · Answer 7 · Wed May 22 2024 23:05:55 GMT+0800 (China Standard Time)

When I use v0.1.1. still has same issue, out of memory.
train1.log

can you help me? Thanks a lot

Igor Gitman · Answer 8 · Thu May 23 2024 02:32:19 GMT+0800 (China Standard Time)

Let me try to reproduce this today. It's always best to untar the .nemo checkpoint (it's just a tar archive) as it will save the time to load it. If you are able to use more than 1 node, that will certainly help, but I think mistral-7b should be trainable on 1 node, so will double check on my side

Igor Gitman · Answer 9 · Thu May 23 2024 11:56:56 GMT+0800 (China Standard Time)

sorry, didn't get to try this today, hopefully will have some time tomorrow. Meanwhile, can you please try to set data.train_ds.max_seq_length=1024 and see if that helps?

zdaiot · Answer 10 · Thu May 23 2024 18:38:51 GMT+0800 (China Standard Time)

sorry, didn't get to try this today, hopefully will have some time tomorrow. Meanwhile, can you please try to set data.train_ds.max_seq_length=1024 and see if that helps?

Thank you for your guidance. I can barely train by using data.train_ds.max_seq_length= 1024.
But I have observed that at the beginning, training takes up half of the GPU memory (31881MiB / 81920MiB).
During the first validation, the GPU memory increased slightly (37877MiB / 81920MiB).
After saving checkpoint for the first time, and then training, the GPU memory doubles directly (81093MiB / 81920MiB).
After that, I can barely train, and I will continue to observe.

I would like to ask why there is a doubling of GPU memory, is the memory leak? I use ps-axu | grep python, and there will be a lot of processes.

python.log
I would also like to ask why you did not add the following CODE_SEPARATORS and CODE_OUTPUT_SEPARATORS as a special token during training.

CODE_SEPARATORS = ('<llm-code>','</llm-code>') # used to execute code within these tags. 
CODE_OUTPUT_SEPARATORS = ('<llm-code-output>','</llm-code-output>') # used to extract the code output

Now,such as <llm-code> will be converted to several token id.

I would also like to ask, what is the difference between main branch and v0.1.1? Can I use the main branch to reproduce the results?
I would also like to ask why text_end.endswith(CODE_SEPARATORS[-1]is needed here and then call sandbox.

According to the following training data, the "output" fields not end with CODE_SEPARATORS[-1].
Then the data generated by the trained model will not end with CODE_SEPARATORS[-1]?

{
    "question": "Harry has 50 books in his library. His sister Flora has twice as many books and their cousin Gary has half the books Harry has. How many books do the three of them own together?",
    "expected_answer": "175",
    "predicted_answer": "175",
    "error_message": "",
    "is_correct": true,
    "generation_type": "masked_reference_solution",
    "dataset": "gsm8k",
    "input": "System:\nYou're an expert Python programmer and mathematician. Help the user to solve this problem using code when necessary. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nUser:\nHarry has 50 books in his library. His sister Flora has twice as many books and their cousin Gary has half the books Harry has. How many books do the three of them own together?\n\nAssistant:\n",
    "output": "Let's solve this problem using Python code.\n<llm-code>\nlibrary_books = 50\nflora_books = library_books * 2\ngary_books = library_books / 2\ntotal_books = flora_books + gary_books + library_books\ntotal_books\n</llm-code>\n<llm-code-output>\n175.0\n</llm-code-output>\nThus Harry, Flora, and Gary have \\boxed{175} books in total."
}

Why do you train 4 epoch for Mistral-7B and only 2 epoch for Llama2-70B?
Can I train only 2 epoch of Mistral-7B?

Thanks a lot

Igor Gitman · Answer 11 · Fri May 24 2024 04:59:47 GMT+0800 (China Standard Time)

Ok, if it's training with new max seq length, it means that you're just running OOM because it's a single node and probably the only way to fix it is to run on multiple nodes. But there should not be too many instances in our dataset that are > 1024, so hopefully this wouldn't affect the final results too much.

To answer your questions:

It's most likely because you encounter a longer sequence instances at a later time in training and that's why you see a memory spike. The code might also take more memory during checkpoint saving, although I'm not sure about that. In general, PyTorch will not (by default) return any memory that it allocates (it will pre-allocate large chunks and hold to them to optimize performance), so the memory usage reported through nvidia-smi will only go up, not down. It does not mean that all of that memory is actually in use at any given moment in training.
It's certainly an option and will make inference more efficient. We just didn't want to change the tokenizer of the original models, so that's why we didn't do that.
I'd recommend sticking to v0.1.1 for now as we are going to keep rapidly changing the main branch in the near future as we add more features and refactor our code. We will make a new v0.2 release when our code stabilizes and we test that everything works properly.
CODE_SEPARATORS[-1] is equal to </llm-code> unless you changed something. So that's where we want to stop LLM generation and send the results to the sandbox to get the execution output.
Mostly because we didn't have enough compute and time. In general training for 2 epochs is going to be very close to training for 4 epochs, so you should get something very close to our results if you keep it as 2.

zdaiot · Answer 12 · Fri May 24 2024 15:04:00 GMT+0800 (China Standard Time)

Thanks a lot.

https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/docs/evaluation.md

+prompt_type=code_base \
++prompt.examples_type=gsm8k_text_with_code \
++prompt.num_few_shots=5

It seems that it should be

+prompt=code_base \
++prompt.examples_type=gsm8k_text_with_code \
++prompt.num_few_shots=5

Why didn't want to change the tokenizer of the original models? What is the underlying reason here?
The Mistral-7B only trained for 2 epochs, and the performance is not very good, with only a 67.7% accuracy rate on gsm8k. I will continue to train for an additional two epochs to see if it improves.
I want to know if this project supports lora training? Do I just need to change peft.peft_scheme to lora?
May I ask if you have any documentation on PyTorch memory allocation strategies? I would like to delve deeper into it.

Igor Gitman · Answer 13 · Sat May 25 2024 01:18:22 GMT+0800 (China Standard Time)

You're right, thanks for catching that!
Tbh, we just didn't explore that, so I'm not sure if it's easy or hard to do. If the tokenizer already has some extra tokens that are not used, it should be very easy to just set our code tokens to those extra ones. But if it does not have any untrained tokens, you'd need to change the embedding dimension and that is more involved and might break some compatibility with other packages.
Is that on the validation or test set? That does not look right - are you changing any other parameters/code or just running our provided commands directly, but on a single node?
We never tried it, but it should work. Please let us know if you run into any errors.
This is probably a good start https://pytorch.org/docs/stable/notes/cuda.html#memory-management and you can follow the links there to get more in-depth information.

zdaiot · Answer 14 · Sat May 25 2024 12:17:06 GMT+0800 (China Standard Time)

You're right, thanks for catching that!

Tbh, we just didn't explore that, so I'm not sure if it's easy or hard to do. If the tokenizer already has some extra tokens that are not used, it should be very easy to just set our code tokens to those extra ones. But if it does not have any untrained tokens, you'd need to change the embedding dimension and that is more involved and might break some compatibility with other packages.

Is that on the validation or test set? That does not look right - are you changing any other parameters/code or just running our provided commands directly, but on a single node?

We never tried it, but it should work. Please let us know if you run into any errors.

This is probably a good start https://pytorch.org/docs/stable/notes/cuda.html#memory-management and you can follow the links there to get more in-depth information.

I use the first and second steps here to prepare the data. Step3 to Convert the model

I use the following commands for training, change data.train_ds.max_seq_length=1024, train on 8x80GB A100 signle node.

python pipeline/run_sft_and_eval.py \
   --expname openmath-mistral-7b \
   --nemo_model /dockerdata/zhaodali/.cache/nemo_old/Mistral-7B-v0.1-untarred \
   --stages sft prepare_eval \
   --num_nodes 1 \
   --num_gpus 8 \
   --disable_wandb \
   ++model.data.train_ds.file_path=/data/sft-data.jsonl \
   ++trainer.sft.max_epochs=4 \
   ++trainer.sft.val_check_interval=600 \
   ++model.tensor_model_parallel_size=8 \
   ++model.pipeline_model_parallel_size=1 \
   ++model.optim.lr=1e-6

The final weight file is as follows:

'megatron_gpt_sft--val_code_generation_accuracy=0.929-step=3000-consumed_samples=384000-epoch=1'
'megatron_gpt_sft--val_code_generation_accuracy=0.940-step=2400-consumed_samples=307200-epoch=1'
'megatron_gpt_sft--val_code_generation_accuracy=0.943-step=3600-consumed_samples=460800-epoch=2'
'megatron_gpt_sft--val_code_generation_accuracy=0.949-step=4800-consumed_samples=614400-epoch=3'
'megatron_gpt_sft--val_code_generation_accuracy=0.950-step=4200-consumed_samples=537600-epoch=2'
'megatron_gpt_sft--val_code_generation_accuracy=0.956-step=5400-consumed_samples=691200-epoch=3'
'megatron_gpt_sft--val_code_generation_accuracy=0.958-step=6344-consumed_samples=812032-epoch=4'
'megatron_gpt_sft--val_code_generation_accuracy=0.958-step=6344-consumed_samples=812032-epoch=4-last'
'megatron_gpt_sft--val_code_generation_accuracy=0.959-step=6000-consumed_samples=768000-epoch=3'

The command used in the final test is

python pipeline/run_eval.py \
  --model_path /dockerdata/zhaodali/.cache/nemo_old/nemo_skills_results/nemo-skills-exps/checkpoints/openmath-mistral-7b/openmath-mistral-7b.nemo \
  --server_type nemo \
  --output_dir `pwd`/openmath-mistral-7b-eval-results \
  --benchmarks gsm8k:0 \
  --num_gpus 8 \
  --num_nodes 1 \
  +prompt=code_sfted \
  ++prompt.num_few_shots=0 \
  ++split_name=test \
  ++server.max_code_executions=6 \
  ++server.stop_on_code_error=False \
  ++batch_size=64

python pipeline/summarize_results.py `pwd`/openmath-mistral-7b-eval-results

After training 4 epoch, the results are as follows (I tried to average different steps weight files, and the results were not much different.):

benchmark,decoding,num_entries,correct_answer,wrong_answer,no_answer
gsm8k,greedy,1319,70.20,28.96,0.83

I tested the weight you opened up(OpenMath-Mistral-7B ), result is:

benchmark,decoding,num_entries,correct_answer,wrong_answer,no_answer
gsm8k,greedy,1319,79.91,19.26,0.83

Is this because of the modification of data.train_ds.max_seq_length= 1024?
But it seems that only a few data lengths are more than 1024. Is there any parameter that can directly throw away training samples more than 1024 in length?

Thanks a lot

Igor Gitman · Answer 15 · Sat May 25 2024 12:36:37 GMT+0800 (China Standard Time)

Somehow your epoch count is not matching the size of the dataset. Can you please check how many elements are in the sft-data.jsonl? It should be 1024000 and then the global batch size is 128, so 1024000 / 128 = 8000. So it should be 8000 samples per epoch and you're only getting half of that if not less

Igor Gitman · Answer 16 · Sat May 25 2024 12:38:21 GMT+0800 (China Standard Time)

By the way, these parameters

  ++server.max_code_executions=6 \
  ++server.stop_on_code_error=False \

are only supported for trtllm model, for nemo model (in v0.1.1) they are simply ignored. But they only make a difference of 1-2%, not more, so certainly there is some bigger issue here

Igor Gitman · Answer 17 · Sat May 25 2024 12:46:35 GMT+0800 (China Standard Time)

Let me also try to run the same commands as you're doing on my side to see if I can reproduce this regression. I'd be really surprised if it's caused by the max_seq_length, so want to understand what's going on here

Igor Gitman · Answer 18 · Sat May 25 2024 12:55:57 GMT+0800 (China Standard Time)

another question is why do you set val_check_interval=600? This causes you to save too many checkpoints and we only keep 8, so they will not be equally spaced. We found that it's generally best to save at equal intervals 4-8 checkpoints and then average all of them

zdaiot · Answer 19 · Sat May 25 2024 14:28:35 GMT+0800 (China Standard Time)

another question is why do you set val_check_interval=600? This causes you to save too many checkpoints and we only keep 8, so they will not be equally spaced. We found that it's generally best to save at equal intervals 4-8 checkpoints and then average all of them

Somehow your epoch count is not matching the size of the dataset. Can you please check how many elements are in the sft-data.jsonl? It should be 1024000 and then the global batch size is 128, so 1024000 / 128 = 8000. So it should be 8000 samples per epoch and you're only getting half of that if not less

Thank you very much, my open-math-instruct-1/sft-data.jsonl file is corrupted(the reason I use val_check_interval=600) and I regenerated it again. But it will take about 10 days to complete the training, and I may first experiment on a small datasets.

I would like to ask, during the training process, the model.data.validation_ds.file_path is datasets/gsm8k/train_ sft.jsonl, but the OpenMathInstruct-1 should be generated using datasets/gsm8k/train_ full.jsonl, so the sft training set seems to contain the validation_ds set.
Can you open scripts from synthetic-solutions to open-math-instruct-1?
When I have generated the data, I just need to put all the jsonl file paths in the synthetic-solutions folder in the preprocessed_dataset_files parameter in the following command?

python nemo_skills/finetuning/prepare_sft_data.py \
    ++preprocessed_dataset_files="xxxx" \
    ++output_path=$NEMO_SKILLS_DATA/sft-data.jsonl \
    ++downsampling_method=fair \
    ++num_output_samples=1024000 \
    ++text_filter_type=any_code \
    ++trim_solutions=True

Igor Gitman · Answer 20 · Sat May 25 2024 20:52:08 GMT+0800 (China Standard Time)

I would like to ask, during the training process, the model.data.validation_ds.file_path is datasets/gsm8k/train_ sft.jsonl, but the OpenMathInstruct-1 should be generated using datasets/gsm8k/train_ full.jsonl, so the sft training set seems to contain the validation_ds set.

That's right, for the final model training round we combine both "train" and "validation" subsets to be able to compare to prior works (as gsm8k does not have a standard validation split and other people use full training data to run their experiments). But if you're going to do research and run hyper-parameter searches (and only need to compare to your own baseline), we'd recommend you only use the "train" subset and evaluation on the "validation".

If you're running on train_full subset and want to make things a bit faster you can also set trainer.sft.limit_val_batches=1 (integer value) to only run validation on 1 batch as those numbers are not meaningful anyway. And also set validation batch size to be 8 (should be at least number of GPUs I think) and potentially decrease maximum sequence length inside model.inference.length_params.max_length. That will ensure that your validation does not slow down training anymore (I don't think you can fully disable it, so just need to make sure it's very fast).

Can you open scripts from synthetic-solutions to open-math-instruct-1?
When I have generated the data, I just need to put all the jsonl file paths in the synthetic-solutions folder in the preprocessed_dataset_files parameter in the following command?

I don't fully understand the question. But if you generate new data, you should use prediction_jsonl_files parameter instead of the preprocessed_dataset_files. You can then pass in generated output-rs*.jsonl to that script and it will prepare the data for sft. I guess we missed that part in the docs - will add it somewhere.

If you want to use a subset you can potentially only limit yourself to gsm8k, so manually prepare a subset with "dataset": "gsm8k" and set num_output_samples to 256000 or even 128000. For 256K I think you should be getting within 1-2% of our best results for gsm8k. For 128K the regression might be larger, but shouldn't be more than 5% or so. You can also try to set tensor_parallel_size to be 4 or even 2 to make the job run faster, but I'm not fully sure about the effect of that.

zdaiot · Answer 21 · Mon May 27 2024 11:10:20 GMT+0800 (China Standard Time)

Special thanks to you, I found that by using the 61K "dataset": "gsm8k", the accuracy of Mistral-7B can reach 74.53%.

Also, I wanted to ask, have you ever encountered Error: no such file /tmp/server_logs.txt when executing the command below? This is because tail -n0 -f /tmp/server_logs.txt | sed '/Running on all addresses/ q' is executed immediately after {server_start_cmd} starts running.

export PYTHONPATH=$PYTHONPATH:/code && \
{server_start_cmd} && \
if [ $SLURM_LOCALID -eq 0 ]; then \
    echo "Waiting for the server to start" && \
    tail -n0 -f /tmp/server_logs.txt | sed '/Running on all addresses/ q' && \

My solution is:

export PYTHONPATH=$PYTHONPATH:/code && \
{server_start_cmd} && sleep 10 && \
if [ $SLURM_LOCALID -eq 0 ]; then \
    echo "Waiting for the server to start" && \
    tail -f /tmp/server_logs.txt | sed '/Running on all addresses/ q' && \

Igor Gitman · Answer 22 · Tue May 28 2024 01:43:16 GMT+0800 (China Standard Time)

Great to hear you're getting some good results! I also double checked that max_seq_length=1024 shouldn't affect the final results significantly. After using it + running for 1 epoch on full dataset, I was able to get the following (using nemo eval):

Running compute_metrics.py for math
2024-05-27 10:37:57 INFO  Greedy results
2024-05-27 10:37:58 INFO  Evaluation results for ['../experiments/nemo-skills-exps/results/openmath-mistral-7b-repro/math/output-greedy.jsonl']
2024-05-27 10:37:58 INFO  Total eval entries: 5000
2024-05-27 10:37:58 INFO  Correct answer: 41.16%
2024-05-27 10:37:58 INFO  Wrong answer: 41.14%
2024-05-27 10:37:58 INFO  No answer: 17.70%
2024-05-27 10:37:58 INFO  Running compute_metrics.py for gsm8k
2024-05-27 10:37:58 INFO  Greedy results
2024-05-27 10:37:58 INFO  Evaluation results for ['../experiments/nemo-skills-exps/results/openmath-mistral-7b-repro/gsm8k/output-greedy.jsonl']
2024-05-27 10:37:58 INFO  Total eval entries: 1319
2024-05-27 10:37:58 INFO  Correct answer: 79.08%
2024-05-27 10:37:58 INFO  Wrong answer: 19.64%
2024-05-27 10:37:58 INFO  No answer: 1.29%
2024-05-27 10:37:58 INFO  benchmark,decoding,num_entries,correct_answer,wrong_answer,no_answer
2024-05-27 10:37:58 INFO  math,greedy,5000,41.16,41.14,17.70
2024-05-27 10:37:58 INFO  gsm8k,greedy,1319,79.08,19.64,1.29

By the way, if you can identify a "smart" small subset of our dataset that gives similar results to using full data, that will be a great research contribution. We think that by smartly selecting a subset of solutions (not just randomly), it should be possible to significantly reduce the data size without any impact on the final results, but we don't have any experiments to demonstrate that yet.

Igor Gitman · Answer 23 · Tue May 28 2024 01:45:12 GMT+0800 (China Standard Time)

We never saw an error you're mentioning, but if sleep for a few seconds works, that's great. Probably means that your filesystem is somehow slow, since the output file should be created immediately after the server command is launched. Let me add a sleep for a few seconds to our code, so that other people don't face the same issue.

Igor Gitman · Answer 24 · Thu May 30 2024 01:35:25 GMT+0800 (China Standard Time)

Let me close this issue as I think all of the questions are resolved. Please feel free to open new issue/discussion if any more problems arise.

zdaiot · Answer 25 · Thu May 30 2024 10:35:36 GMT+0800 (China Standard Time)

Thanks a lot

zdaiot · Answer 26 · Mon Jun 03 2024 17:03:24 GMT+0800 (China Standard Time)

@Kipok
Hello, I have a few more questions to ask.

During SFT training: Is inference.strategy only used in the validation phase? That is to say, during the training phase, the model will predict llm-code-output and also calculate the loss. But in the validation phase, llm-code-output is generated with the help of the sandbox.
When validating on the test set, the following logs are generated. Which function prints them?

[pid: 12|app: 0|req: 94/115] 172.17.0.5 () {40 vars in 524 bytes} [Mon Jun  3 07:40:41 2024] PUT /execute_code => generated 302 bytes in 36 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
172.17.0.5 - - [03/Jun/2024:07:40:41 +0000] "PUT /execute_code HTTP/1.1" 200 302 "-" "python-requests/2.31.0" "-"

After SFT, how to test the performance of the model without the help of the sandbox?
When validating on the test set, are the results of nemo and tensort-llm the same? I see https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/docs/evaluation.md uses nemo, but https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/docs/reproducing-results.md uses tensort-llm.
Why set kwargs['handle_code_execution'] = False When use Nemo? https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/nemo_skills/inference/server/model.py#L262

Thanks again

Igor Gitman · Answer 27 · Tue Jun 04 2024 03:08:39 GMT+0800 (China Standard Time)

That's right, the model will be trained on the code outputs in the same way as on any other tokens. We didn't really check if it's helpful or harmful. It is possible to change this, but we don't directly support it in our code. During testing/validation the code output is always added via sandbox.
This is printed from the sandbox server (default flask logs). If you don't want to see sandbox logs, you can disable them with something like this https://stackoverflow.com/questions/14888799/disable-console-messages-in-flask-server here https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/nemo_skills/code_execution/local_sandbox/local_sandbox_server.py#L28
I'm not sure why you would want to do that. If you run it without sandbox, it will still generate llm-code tokens, but instead of seeing the ground-truth output, it will just hallucinate the llm-code-output tokens and everything in-between them. So it will most likely be performing quite poorly. If you still want to do that, remove inference strategy from here https://github.com/Kipok/NeMo-Skills/blob/v0.1.1/nemo_skills/inference/server/serve_nemo.py#L137 and line 144 there as well. Or set handle_code_execution to False for trtllm if that's what you're using to eval
In the eval docs nemo is just used as an example, you can change it to trtllm provided you convert the checkpoint first. The results will not be the same, but should be within 1-2% difference from each other. Trtllm is just way faster and also supports continuing execution after code errors, so that's why we used it for our final evaluations.
That's because our code execution implementation for nemo is done directly inside the generation code. This means that we don't need to make a second call after getting sandbox execution results and can re-use the kv-cache of the existing generation. Basically, you can set it to True if you want and you would get the same results, just much slower.