[CLI]: GPUs Hanging when distributed training caused by `wandb.watch`

Question

[CLI]: GPUs Hanging when distributed training caused by `wandb.watch`

nmd2k opened this issue 2 months ago · comments

Describe the bug

I found that when distributed training, the wandb.watch with the argument log="all" will lead to GPUs hanging (the rank 0 GPUs loaded but not working, the rank 1 run without any further progress).

(with log="all")

The integrated code:

if accelerator.is_main_process and args.report_to == "wandb":
      wandb.watch(model, log="all", log_freq=args.logging_steps)

The problem is gone when removed the argument log="all". It seems like something is wrong with logging model parameter.

(without log="all")

Additional Files

No response

Environment

wandb = 0.16.5
transformers = 4.39.0
pytorch = 2.2.1
accelerate = 0.29.3

Additional Context

No response

fmamberti-wandb · Answer 1 · Sat Apr 20 2024 00:07:03 GMT+0800 (China Standard Time)

Hi @nmd2k, thank you for reporting this and letting us know you have been experiencing the issue with log="all" only.

Would you mind sharing some additional information to help us reproduce and troubleshoot the issue:

The debug.log and debug-internal.log files you can find in the `./wandb/run-<date_time>-/logs folder
What is your experiment environment setup? Are you running the training locally or on a remote resource? If so which kind and how is the training initiated? Are you running via Jupyter Notebook or through a script?
A code snippet for your training experiment would also be useful

fmamberti-wandb · Answer 2 · Wed Apr 24 2024 19:40:04 GMT+0800 (China Standard Time)

Hi @nmd2k , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Nguyễn Mạnh Dũng · Answer 3 · Thu Apr 25 2024 14:55:37 GMT+0800 (China Standard Time)

Hi, sorry for the late reply.
I used the official huggingface example to fine-tune LLMs (on a simple predict next token task), the only modification is added the watch function in the line L563

if accelerator.is_main_process and args.report_to == "wandb":
      wandb.watch(model, log="all", log_freq=args.logging_steps)

Here is the debug.log:

2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Configure stats pid to 2311240
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/.config/wandb/settings
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/foundation-models/wandb/settings
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'project': 'foundation-models-exp2', 'api_key': '***REDACTED***'}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train/ft_no_trainer.py', 'program_abspath': '/home/dungnm31/foundation-models/train/ft_no_trainer.py', 'program': '/home/dungnm31/foundation-models/train/ft_no_trainer.py'}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:_log_setup():526] Logging user logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug.log
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:_log_setup():527] Logging internal logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug-internal.log
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():566] calling init triggers
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
config: {}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():616] starting backend
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():620] setting up manager
2024-04-25 13:45:26,332 INFO    MainThread:2311240 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-25 13:45:26,334 INFO    MainThread:2311240 [wandb_init.py:init():628] backend started and connected
2024-04-25 13:45:26,336 INFO    MainThread:2311240 [wandb_init.py:init():720] updated telemetry
2024-04-25 13:45:26,443 INFO    MainThread:2311240 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
2024-04-25 13:45:27,111 INFO    MainThread:2311240 [wandb_run.py:_on_init():2262] communicating current version
2024-04-25 13:45:27,410 INFO    MainThread:2311240 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.16.6 is available!  To upgrade, please run:\n $ pip install wandb --upgrade"

2024-04-25 13:45:27,410 INFO    MainThread:2311240 [wandb_init.py:init():804] starting run threads in backend
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_console_start():2241] atexit reg
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2096] redirect: wrap_raw
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2161] Wrapping output streams.
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2186] Redirects installed.
2024-04-25 13:45:30,401 INFO    MainThread:2311240 [wandb_init.py:init():847] run started, returning control to user process
2024-04-25 13:45:30,402 INFO    MainThread:2311240 [wandb_run.py:_config_callback():1343] config_cb None None {'dataset_name_or_path': '/cm/archive/dungnm31/data/foundation-model/data_test.jsonl', 'dataset_config_name': None, 'model_name_or_path': 'mistralai/Mistral-7B-v0.1', 'lora': False, 'config_name': None, 'tokenizer_name': None, 'use_slow_tokenizer': False, 'per_device_train_batch_size': 2, 'per_device_eval_batch_size': 2, 'learning_rate': 2.5e-05, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-07, 'weight_decay': 3e-05, 'num_train_epochs': 3, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': False, 'lr_scheduler_type': 'linear', 'warmup_steps': 50, 'output_dir': '/cm/archive/dungnm31/foundation-exps/mistral7B-LR-finalv1-1', 'seed': 42, 'preprocessing_num_workers': 50, 'max_length': 1024, 'prompt_template': 'llama', 'trust_remote_code': True, 'logging_steps': 1, 'eval_steps': 50, 'save_steps': 50, 'save_total_limit': 10, 'resume_from_checkpoint': None, 'report_to': 'wandb', 'low_cpu_mem_usage': False, 'metric_for_best_model': 'loss'}
2024-04-25 13:45:30,447 INFO    MainThread:2311240 [wandb_watch.py:watch():51] Watching

Here is the debug-internal.log:
debug-internal.log

fzppp · Answer 4 · Fri May 10 2024 14:11:56 GMT+0800 (China Standard Time)

Same issue. The whole training process will hold for several minutes every fixed steps (rank 0 GPU will be 0% util). Everything returns to normal as I delete the "wandb_watch["all"]".

luisbergua · Answer 5 · Tue May 21 2024 20:30:27 GMT+0800 (China Standard Time)

Hey @nmd2k @fzp0424, thanks for sharing these details! Would you have any problems with setting os.environ["WANDB_WATCH"] = "all" instead of passing as an argument and seeing if you face the same issue?

luisbergua · Answer 6 · Wed May 29 2024 17:00:56 GMT+0800 (China Standard Time)

Hi there, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!