[BUG] Missing init_process_group call when converting model to HF format.
benoriol opened this issue · comments
Describe the bug
Getting the following error when trying to convert Mistral 7B model from HF format to mcore using instructions in docs/llama_mistral.md
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
To Reproduce
I am following step by step the instructions in the page linked below
- First download the Mistral weight from Huggingface
- Install
mistral-commons
- Here is where I get the error. Run the command
python tools/checkpoint/convert.py
with the same arguments described in the tutorial:python tools/checkpoint/convert.py --model-type GPT --loader llama_mistral --saver mcore --target-tensor-parallel-size 4 --checkpoint-type hf --load-dir /workspace/checkpoints/Mistral-7B-v0 .3/ --save-dir /workspace/checkpoints/Mistral-7B-v0.3-Megatron --tokenizer-model /workspace/checkpoints/Mistral-7B-v0.3/tokenizer.model --model-size mistral-7B
.
Expected behavior
The conversion script should continue running and obtain the mcore format of the model.
Stack trace/logs
root@9d4cb397ecd1:/workspace/megatron# python tools/checkpoint/convert.py --model-type GPT --loader llama_mistral --saver mcore --target-tensor-parallel-size 4 --checkpoint-type hf --load-dir /workspace/checkpoints/Mistral-7B-v0
.3/ --save-dir /workspace/checkpoints/Mistral-7B-v0.3-Megatron --tokenizer-model /workspace/checkpoints/Mistral-7B-v0.3/tokenizer.model --model-size mistral-7B
Loaded loader_llama_mistral as the loader.
Loaded saver_mcore as the saver.
Starting saver...
Starting loader...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00, 8.80s/it]
building GPT model ...
set layer states: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00, 9.11it/s]
sending embeddings
Overwriting default ffn_hidden_size value None with value from checkpoint 14336.
Overwriting default kv_channels value None with value from checkpoint 128.
Overwriting default group_query_attention value False with value from checkpoint True.
Overwriting default num_query_groups value 1 with value from checkpoint 8.
Overwriting default use_rotary_position_embeddings value False with value from checkpoint True.
Overwriting default add_position_embedding value True with value from checkpoint False.
Overwriting default normalization value LayerNorm with value from checkpoint RMSNorm.
Overwriting default swiglu value False with value from checkpoint True.
Overwriting default global_batch_size value None with value from checkpoint 1024.
Overwriting default dataloader_type value None with value from checkpoint single.
Overwriting default use_legacy_models value False with value from checkpoint True.
Overwriting default load value None with value from checkpoint /workspace/checkpoints/Mistral-7B-v0.3/.
Overwriting default overlap_p2p_comm value True with value from checkpoint False.
Overwriting default vocab_size value None with value from checkpoint 32768.
Overwriting default transformer_impl value transformer_engine with value from checkpoint local.
Checkpoint had argument iteration but new arguments does not have this.
Checkpoint had argument padded_vocab_size but new arguments does not have this.
Checkpoint had argument transformer_pipeline_model_parallel_size but new arguments does not have this.
Checkpoint had argument data_parallel_size but new arguments does not have this.
Checkpoint had argument consumed_train_samples but new arguments does not have this.
Checkpoint had argument consumed_valid_samples but new arguments does not have this.
Checkpoint had argument variable_seq_lengths but new arguments does not have this.
Checkpoint had argument disable_bias_linear but new arguments does not have this.
Checkpoint had argument model_type but new arguments does not have this.
Checkpoint had argument model_size but new arguments does not have this.
Setting consumed_train_samples to 0 and consumed_valid_samples to 0 18:44:38 [46/18008]
sending transformer layer 0
sending transformer layer 1
sending transformer layer 2
sending transformer layer 3
sending transformer layer 4
received embeddings
Original vocab size not specified, leaving embedding table as-is. If you've changed the tensor parallel size this could cause problems.
building GPT model ...
sending transformer layer 5
sending transformer layer 6
sending transformer layer 7
sending transformer layer 8
sending transformer layer 9
sending transformer layer 10
/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py:607: UserWarning: To guarantee overlapping TP and SP collectives with the backwardGEMMs, set environment variable CUDA_DEVICE_MAX_CONNECTIONS = 1
warnings.warn(
sending transformer layer 11
> memory usage: 'saver', rank 0 / 4, mem 0.6/1121.8 gb.
building GPT model ...
sending transformer layer 12
> memory usage: 'saver', rank 1 / 4, mem 0.6/1121.8 gb.
building GPT model ...
sending transformer layer 13
sending transformer layer 14
sending transformer layer 15
> memory usage: 'saver', rank 2 / 4, mem 0.6/1121.8 gb.
building GPT model ...
sending transformer layer 16
> memory usage: 'saver', rank 3 / 4, mem 0.6/1121.8 gb.
sending transformer layer 17
sending transformer layer 18
sending transformer layer 19
received transformer layer 0
sending transformer layer 20
received transformer layer 1
sending transformer layer 21
sending transformer layer 22
sending transformer layer 23
sending transformer layer 24
received transformer layer 2
sending transformer layer 25
sending transformer layer 26
sending transformer layer 27
sending transformer layer 28
sending transformer layer 29
sending transformer layer 30
received transformer layer 3
sending transformer layer 31
sending final norm
sending output layer
received transformer layer 4
received transformer layer 5
received transformer layer 6
Waiting for saver to complete...
received transformer layer 7
received transformer layer 8
received transformer layer 9
received transformer layer 10
received transformer layer 11
received transformer layer 12
received transformer layer 13
received transformer layer 14
received transformer layer 15
received transformer layer 16
received transformer layer 17
received transformer layer 18
received transformer layer 19
received transformer layer 20
received transformer layer 21
received transformer layer 22
received transformer layer 23
received transformer layer 24
received transformer layer 25
received transformer layer 26
received transformer layer 27
received transformer layer 28
received transformer layer 29
received transformer layer 30
received transformer layer 31
received final norm
received output layer
saving checkpoint at iteration 1 to /workspace/checkpoints/Mistral-7B-v0.3-Megatron in torch format
successfully saved checkpoint from iteration 1 to /workspace/checkpoints/Mistral-7B-v0.3-Megatron
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/megatron/tools/checkpoint/saver_mcore.py", line 669, in save_checkpoint
save_checkpoint(md.iteration, [models[tp_rank]], None, None,
File "/workspace/megatron/megatron/training/checkpointing.py", line 410, in save_checkpoint
logger.debug(f"rank: {torch.distributed.get_rank()}, takes {end_misc - start_misc} to finalize ckpt save ")
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1633, in get_rank
default_pg = _get_default_group()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 985, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Environment (please complete the following information):
- Megatron-LM commit ID: df61e60
- PyTorch version: '2.3.0a0+ebedce2'
- CUDA version: '12.3'
- NCCL version: (2, 20, 3)
Proposed fix
Initialize torch.distributed at some point in the script.
Additional context
I am running on a single node with 8 A100 GPU, I don't plan on using it on more than one node.
Using the current main
branch
Started docker with
docker run --gpus all -it --rm -v /my/path/to/Megatron-LM:/workspace/megatron -v /my/path/to/data:/workspace/dataset -v /my/path/to/checkpoints:/workspace/checkpoints -w /workspace/megatron/examples/multimodal --shm-size 6G megatron:multimodal
Dockerfile in Megatron-LM/examples/multimodal
since I am planning to use this to train multimodal LlaVa as described in examples/multimodal
My WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT environment variables are unset. I am not sure if I have any other env variables that can affect.
From outside the docker:
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
f1584b1ae327 bridge bridge local
1ce4f004699d host host local
06c0830e8909 my_network bridge local
ee252c0411d4 none null local
I have already updated to 0bc3547, but encountered the same problem when converting the llama3 70B model.
I meant I had to downgrade to that commit via git checkout 86850db930c85ed925e661574acc7564debf7988
Pseudo-solved by checking out the following commit: 86850db
temporarily sovled the problem...
@jon-barker Can you take a look into this please
This is a regression. An update to checkpointing code last week made the incorrect assumption that you'd always be in a distributed setting when saving checkpoints. We'll make a fix internally and push it out asap.
A short term WAR is to comment out the logger.debug
lines in megatron/training/checkpointing.py
We have implemented a fix internally and it will be pushed out to github later this week.
Now fixed in main