[CLI]: GPUs Hanging when distributed training caused by `wandb.watch`
nmd2k opened this issue · comments
Describe the bug
I found that when distributed training, the wandb.watch
with the argument log="all"
will lead to GPUs hanging (the rank 0 GPUs loaded but not working, the rank 1 run without any further progress).
The integrated code:
if accelerator.is_main_process and args.report_to == "wandb":
wandb.watch(model, log="all", log_freq=args.logging_steps)
The problem is gone when removed the argument log="all"
. It seems like something is wrong with logging model parameter.
Additional Files
No response
Environment
wandb = 0.16.5
transformers = 4.39.0
pytorch = 2.2.1
accelerate = 0.29.3
Additional Context
No response
Hi @nmd2k, thank you for reporting this and letting us know you have been experiencing the issue with log="all" only.
Would you mind sharing some additional information to help us reproduce and troubleshoot the issue:
- The
debug.log
anddebug-internal.log
files you can find in the `./wandb/run-<date_time>-/logs folder - What is your experiment environment setup? Are you running the training locally or on a remote resource? If so which kind and how is the training initiated? Are you running via Jupyter Notebook or through a script?
- A code snippet for your training experiment would also be useful
Hi @nmd2k , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.
Hi, sorry for the late reply.
I used the official huggingface example to fine-tune LLMs (on a simple predict next token task), the only modification is added the watch
function in the line L563
if accelerator.is_main_process and args.report_to == "wandb":
wandb.watch(model, log="all", log_freq=args.logging_steps)
Here is the debug.log
:
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_setup.py:_flush():76] Configure stats pid to 2311240
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/.config/wandb/settings
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/foundation-models/wandb/settings
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'project': 'foundation-models-exp2', 'api_key': '***REDACTED***'}
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train/ft_no_trainer.py', 'program_abspath': '/home/dungnm31/foundation-models/train/ft_no_trainer.py', 'program': '/home/dungnm31/foundation-models/train/ft_no_trainer.py'}
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_init.py:_log_setup():526] Logging user logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug.log
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_init.py:_log_setup():527] Logging internal logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug-internal.log
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_init.py:init():566] calling init triggers
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
config: {}
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_init.py:init():616] starting backend
2024-04-25 13:45:26,331 INFO MainThread:2311240 [wandb_init.py:init():620] setting up manager
2024-04-25 13:45:26,332 INFO MainThread:2311240 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-25 13:45:26,334 INFO MainThread:2311240 [wandb_init.py:init():628] backend started and connected
2024-04-25 13:45:26,336 INFO MainThread:2311240 [wandb_init.py:init():720] updated telemetry
2024-04-25 13:45:26,443 INFO MainThread:2311240 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
2024-04-25 13:45:27,111 INFO MainThread:2311240 [wandb_run.py:_on_init():2262] communicating current version
2024-04-25 13:45:27,410 INFO MainThread:2311240 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.16.6 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
2024-04-25 13:45:27,410 INFO MainThread:2311240 [wandb_init.py:init():804] starting run threads in backend
2024-04-25 13:45:30,400 INFO MainThread:2311240 [wandb_run.py:_console_start():2241] atexit reg
2024-04-25 13:45:30,400 INFO MainThread:2311240 [wandb_run.py:_redirect():2096] redirect: wrap_raw
2024-04-25 13:45:30,400 INFO MainThread:2311240 [wandb_run.py:_redirect():2161] Wrapping output streams.
2024-04-25 13:45:30,400 INFO MainThread:2311240 [wandb_run.py:_redirect():2186] Redirects installed.
2024-04-25 13:45:30,401 INFO MainThread:2311240 [wandb_init.py:init():847] run started, returning control to user process
2024-04-25 13:45:30,402 INFO MainThread:2311240 [wandb_run.py:_config_callback():1343] config_cb None None {'dataset_name_or_path': '/cm/archive/dungnm31/data/foundation-model/data_test.jsonl', 'dataset_config_name': None, 'model_name_or_path': 'mistralai/Mistral-7B-v0.1', 'lora': False, 'config_name': None, 'tokenizer_name': None, 'use_slow_tokenizer': False, 'per_device_train_batch_size': 2, 'per_device_eval_batch_size': 2, 'learning_rate': 2.5e-05, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-07, 'weight_decay': 3e-05, 'num_train_epochs': 3, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': False, 'lr_scheduler_type': 'linear', 'warmup_steps': 50, 'output_dir': '/cm/archive/dungnm31/foundation-exps/mistral7B-LR-finalv1-1', 'seed': 42, 'preprocessing_num_workers': 50, 'max_length': 1024, 'prompt_template': 'llama', 'trust_remote_code': True, 'logging_steps': 1, 'eval_steps': 50, 'save_steps': 50, 'save_total_limit': 10, 'resume_from_checkpoint': None, 'report_to': 'wandb', 'low_cpu_mem_usage': False, 'metric_for_best_model': 'loss'}
2024-04-25 13:45:30,447 INFO MainThread:2311240 [wandb_watch.py:watch():51] Watching
Here is the debug-internal.log
:
debug-internal.log
Same issue. The whole training process will hold for several minutes every fixed steps (rank 0 GPU will be 0% util). Everything returns to normal as I delete the "wandb_watch["all"]".
Hi there, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!