NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

Home Page:https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Missing init_process_group call when converting model to HF format.

benoriol opened this issue · comments

commented

Describe the bug
Getting the following error when trying to convert Mistral 7B model from HF format to mcore using instructions in docs/llama_mistral.md

ValueError: Default process group has not been initialized, please make sure to call init_process_group.

To Reproduce
I am following step by step the instructions in the page linked below

  1. First download the Mistral weight from Huggingface
  2. Install mistral-commons
  3. Here is where I get the error. Run the command python tools/checkpoint/convert.py with the same arguments described in the tutorial: python tools/checkpoint/convert.py --model-type GPT --loader llama_mistral --saver mcore --target-tensor-parallel-size 4 --checkpoint-type hf --load-dir /workspace/checkpoints/Mistral-7B-v0 .3/ --save-dir /workspace/checkpoints/Mistral-7B-v0.3-Megatron --tokenizer-model /workspace/checkpoints/Mistral-7B-v0.3/tokenizer.model --model-size mistral-7B.

Expected behavior
The conversion script should continue running and obtain the mcore format of the model.

Stack trace/logs

root@9d4cb397ecd1:/workspace/megatron# python tools/checkpoint/convert.py --model-type GPT  --loader llama_mistral --saver mcore  --target-tensor-parallel-size 4  --checkpoint-type hf --load-dir /workspace/checkpoints/Mistral-7B-v0
.3/ --save-dir /workspace/checkpoints/Mistral-7B-v0.3-Megatron --tokenizer-model /workspace/checkpoints/Mistral-7B-v0.3/tokenizer.model --model-size mistral-7B
Loaded loader_llama_mistral as the loader.
Loaded saver_mcore as the saver.
Starting saver...
Starting loader...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.80s/it]
building GPT model ...
set layer states: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00,  9.11it/s]
sending embeddings
Overwriting default ffn_hidden_size value None with value from checkpoint 14336.
Overwriting default kv_channels value None with value from checkpoint 128.
Overwriting default group_query_attention value False with value from checkpoint True.
Overwriting default num_query_groups value 1 with value from checkpoint 8.
Overwriting default use_rotary_position_embeddings value False with value from checkpoint True.
Overwriting default add_position_embedding value True with value from checkpoint False.
Overwriting default normalization value LayerNorm with value from checkpoint RMSNorm.
Overwriting default swiglu value False with value from checkpoint True.
Overwriting default global_batch_size value None with value from checkpoint 1024.
Overwriting default dataloader_type value None with value from checkpoint single.
Overwriting default use_legacy_models value False with value from checkpoint True.
Overwriting default load value None with value from checkpoint /workspace/checkpoints/Mistral-7B-v0.3/.
Overwriting default overlap_p2p_comm value True with value from checkpoint False.
Overwriting default vocab_size value None with value from checkpoint 32768.
Overwriting default transformer_impl value transformer_engine with value from checkpoint local.
Checkpoint had argument iteration but new arguments does not have this.
Checkpoint had argument padded_vocab_size but new arguments does not have this.
Checkpoint had argument transformer_pipeline_model_parallel_size but new arguments does not have this.
Checkpoint had argument data_parallel_size but new arguments does not have this.
Checkpoint had argument consumed_train_samples but new arguments does not have this.
Checkpoint had argument consumed_valid_samples but new arguments does not have this.
Checkpoint had argument variable_seq_lengths but new arguments does not have this.
Checkpoint had argument disable_bias_linear but new arguments does not have this.
Checkpoint had argument model_type but new arguments does not have this.
Checkpoint had argument model_size but new arguments does not have this.
Setting consumed_train_samples to 0 and consumed_valid_samples to 0                                                                                                                                                 18:44:38 [46/18008]
sending transformer layer 0
sending transformer layer 1
sending transformer layer 2
sending transformer layer 3
sending transformer layer 4
received embeddings
Original vocab size not specified, leaving embedding table as-is. If you've changed the tensor parallel size this could cause problems.
building GPT model ...
sending transformer layer 5
sending transformer layer 6
sending transformer layer 7
sending transformer layer 8
sending transformer layer 9
sending transformer layer 10
/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py:607: UserWarning: To guarantee overlapping TP and SP collectives with the backwardGEMMs, set environment variable CUDA_DEVICE_MAX_CONNECTIONS = 1
  warnings.warn(
sending transformer layer 11
> memory usage: 'saver', rank 0 / 4, mem 0.6/1121.8 gb.
building GPT model ...
sending transformer layer 12
> memory usage: 'saver', rank 1 / 4, mem 0.6/1121.8 gb.
building GPT model ...
sending transformer layer 13
sending transformer layer 14
sending transformer layer 15
> memory usage: 'saver', rank 2 / 4, mem 0.6/1121.8 gb.
building GPT model ...
sending transformer layer 16
> memory usage: 'saver', rank 3 / 4, mem 0.6/1121.8 gb.
sending transformer layer 17
sending transformer layer 18
sending transformer layer 19
received transformer layer 0
sending transformer layer 20
received transformer layer 1
sending transformer layer 21
sending transformer layer 22
sending transformer layer 23
sending transformer layer 24
received transformer layer 2
sending transformer layer 25
sending transformer layer 26
sending transformer layer 27
sending transformer layer 28
sending transformer layer 29
sending transformer layer 30
received transformer layer 3
sending transformer layer 31
sending final norm
sending output layer
received transformer layer 4
received transformer layer 5
received transformer layer 6
Waiting for saver to complete...
received transformer layer 7
received transformer layer 8
received transformer layer 9
received transformer layer 10
received transformer layer 11
received transformer layer 12
received transformer layer 13
received transformer layer 14
received transformer layer 15
received transformer layer 16
received transformer layer 17
received transformer layer 18
received transformer layer 19
received transformer layer 20
received transformer layer 21
received transformer layer 22
received transformer layer 23
received transformer layer 24
received transformer layer 25
received transformer layer 26
received transformer layer 27
received transformer layer 28
received transformer layer 29
received transformer layer 30
received transformer layer 31
received final norm
received output layer
saving checkpoint at iteration       1 to /workspace/checkpoints/Mistral-7B-v0.3-Megatron in torch format
  successfully saved checkpoint from iteration       1 to /workspace/checkpoints/Mistral-7B-v0.3-Megatron
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/megatron/tools/checkpoint/saver_mcore.py", line 669, in save_checkpoint
    save_checkpoint(md.iteration, [models[tp_rank]], None, None,
  File "/workspace/megatron/megatron/training/checkpointing.py", line 410, in save_checkpoint
    logger.debug(f"rank: {torch.distributed.get_rank()}, takes {end_misc - start_misc} to finalize ckpt save ")
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1633, in get_rank
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 985, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Environment (please complete the following information):

  • Megatron-LM commit ID: df61e60
  • PyTorch version: '2.3.0a0+ebedce2'
  • CUDA version: '12.3'
  • NCCL version: (2, 20, 3)

Proposed fix
Initialize torch.distributed at some point in the script.

Additional context
I am running on a single node with 8 A100 GPU, I don't plan on using it on more than one node.
Using the current main branch

Started docker with
docker run --gpus all -it --rm -v /my/path/to/Megatron-LM:/workspace/megatron -v /my/path/to/data:/workspace/dataset -v /my/path/to/checkpoints:/workspace/checkpoints -w /workspace/megatron/examples/multimodal --shm-size 6G megatron:multimodal
Dockerfile in Megatron-LM/examples/multimodal since I am planning to use this to train multimodal LlaVa as described in examples/multimodal

My WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT environment variables are unset. I am not sure if I have any other env variables that can affect.

From outside the docker:

$ docker network ls
NETWORK ID     NAME         DRIVER    SCOPE
f1584b1ae327   bridge       bridge    local
1ce4f004699d   host         host      local
06c0830e8909   my_network   bridge    local
ee252c0411d4   none         null      local
commented

Pseudo-solved by checking out the following commit:
86850db

I have already updated to 0bc3547, but encountered the same problem when converting the llama3 70B model.

commented

I meant I had to downgrade to that commit via git checkout 86850db930c85ed925e661574acc7564debf7988

Pseudo-solved by checking out the following commit: 86850db

temporarily sovled the problem...

@jon-barker Can you take a look into this please

This is a regression. An update to checkpointing code last week made the incorrect assumption that you'd always be in a distributed setting when saving checkpoints. We'll make a fix internally and push it out asap.

A short term WAR is to comment out the logger.debug lines in megatron/training/checkpointing.py

We have implemented a fix internally and it will be pushed out to github later this week.

Now fixed in main