microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error set num-experts>1 when running the generate_test.sh

jrt-20 opened this issue · comments

I am running the examples_deepspeed/generate_text.sh.
From now on, I can success run this script with 1 node 8 gpus when experts=1.
But, when I set experts = 8, errors happen.
The compete errors are as follows :

using world size: 8, data-parallel-size: 8, sequence-parallel size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
[2024-01-22 08:37:16,393] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,399] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,399] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-01-22 08:37:16,648] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,668] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,669] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,687] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,852] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,856] [INFO] [comm.py:637:init_distributed] cdb=None
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
make: Entering directory '/home/ai/jrtPain/Megatron-DeepSpeed/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/ai/jrtPain/Megatron-DeepSpeed/megatron/data'
>>> done with dataset index builder. Compilation time: 0.061 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ai/jrtPain/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ai/jrtPain/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ai/jrtPain/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_softmax_cuda...
>>> done with compiling and loading fused kernels. Compilation time: 2.871 seconds
building GPT model ...
[2024-01-22 08:37:20,393] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,430] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,463] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,488] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,518] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,547] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,577] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,620] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,654] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,686] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,719] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,749] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1

Emitting ninja build file /home/ai/.cache/torch_extensions/py310_cu121/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.11441206932067871 seconds
Traceback (most recent call last):
  File "/home/ai/jrtPain/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 178, in <module>
    main()
  File "/home/ai/jrtPain/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 141, in main
    model = ds_inference(model, args)
  File "/home/ai/jrtPain/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 164, in ds_inference
    engine = deepspeed.init_inference(model=model,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 158, in __init__
    self._apply_injection_policy(config)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 418, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 342, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 586, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  [Previous line repeated 1 more time]
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 622, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 298, in replace_fn
    new_module = replace_with_policy(child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 250, in replace_with_policy
    _container.transpose()
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/megatron.py", line 28, in transpose
    super().transpose()
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/containers/base.py", line 286, in transpose
    self.transpose_mlp()
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/containers/base.py", line 295, in transpose_mlp
    self._h4h_w = self.transpose_impl(self._h4h_w.data)
AttributeError: 'list' object has no attribute 'data'

the version of my environment is :

deepspeed 0.12.6
torch 2.1.1
transformers 4.25.0