wandb / wandb

🔥 A tool for visualizing and tracking your machine learning experiments. This repo contains the CLI and Python API.

Home Page:https://wandb.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[CLI]: GPUs Hanging when distributed training caused by `wandb.watch`

nmd2k opened this issue · comments

Describe the bug

I found that when distributed training, the wandb.watch with the argument log="all" will lead to GPUs hanging (the rank 0 GPUs loaded but not working, the rank 1 run without any further progress).

(with log="all")
image

The integrated code:

if accelerator.is_main_process and args.report_to == "wandb":
      wandb.watch(model, log="all", log_freq=args.logging_steps)

The problem is gone when removed the argument log="all". It seems like something is wrong with logging model parameter.

(without log="all")
image

Additional Files

No response

Environment

wandb = 0.16.5
transformers = 4.39.0
pytorch = 2.2.1
accelerate = 0.29.3

Additional Context

No response

Hi @nmd2k, thank you for reporting this and letting us know you have been experiencing the issue with log="all" only.

Would you mind sharing some additional information to help us reproduce and troubleshoot the issue:

  • The debug.log and debug-internal.log files you can find in the `./wandb/run-<date_time>-/logs folder
  • What is your experiment environment setup? Are you running the training locally or on a remote resource? If so which kind and how is the training initiated? Are you running via Jupyter Notebook or through a script?
  • A code snippet for your training experiment would also be useful

Hi @nmd2k , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi, sorry for the late reply.
I used the official huggingface example to fine-tune LLMs (on a simple predict next token task), the only modification is added the watch function in the line L563

if accelerator.is_main_process and args.report_to == "wandb":
      wandb.watch(model, log="all", log_freq=args.logging_steps)

Here is the debug.log:

2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Configure stats pid to 2311240
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/.config/wandb/settings
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/foundation-models/wandb/settings
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'project': 'foundation-models-exp2', 'api_key': '***REDACTED***'}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train/ft_no_trainer.py', 'program_abspath': '/home/dungnm31/foundation-models/train/ft_no_trainer.py', 'program': '/home/dungnm31/foundation-models/train/ft_no_trainer.py'}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:_log_setup():526] Logging user logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug.log
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:_log_setup():527] Logging internal logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug-internal.log
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():566] calling init triggers
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
config: {}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():616] starting backend
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():620] setting up manager
2024-04-25 13:45:26,332 INFO    MainThread:2311240 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-25 13:45:26,334 INFO    MainThread:2311240 [wandb_init.py:init():628] backend started and connected
2024-04-25 13:45:26,336 INFO    MainThread:2311240 [wandb_init.py:init():720] updated telemetry
2024-04-25 13:45:26,443 INFO    MainThread:2311240 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
2024-04-25 13:45:27,111 INFO    MainThread:2311240 [wandb_run.py:_on_init():2262] communicating current version
2024-04-25 13:45:27,410 INFO    MainThread:2311240 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.16.6 is available!  To upgrade, please run:\n $ pip install wandb --upgrade"

2024-04-25 13:45:27,410 INFO    MainThread:2311240 [wandb_init.py:init():804] starting run threads in backend
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_console_start():2241] atexit reg
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2096] redirect: wrap_raw
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2161] Wrapping output streams.
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2186] Redirects installed.
2024-04-25 13:45:30,401 INFO    MainThread:2311240 [wandb_init.py:init():847] run started, returning control to user process
2024-04-25 13:45:30,402 INFO    MainThread:2311240 [wandb_run.py:_config_callback():1343] config_cb None None {'dataset_name_or_path': '/cm/archive/dungnm31/data/foundation-model/data_test.jsonl', 'dataset_config_name': None, 'model_name_or_path': 'mistralai/Mistral-7B-v0.1', 'lora': False, 'config_name': None, 'tokenizer_name': None, 'use_slow_tokenizer': False, 'per_device_train_batch_size': 2, 'per_device_eval_batch_size': 2, 'learning_rate': 2.5e-05, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-07, 'weight_decay': 3e-05, 'num_train_epochs': 3, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': False, 'lr_scheduler_type': 'linear', 'warmup_steps': 50, 'output_dir': '/cm/archive/dungnm31/foundation-exps/mistral7B-LR-finalv1-1', 'seed': 42, 'preprocessing_num_workers': 50, 'max_length': 1024, 'prompt_template': 'llama', 'trust_remote_code': True, 'logging_steps': 1, 'eval_steps': 50, 'save_steps': 50, 'save_total_limit': 10, 'resume_from_checkpoint': None, 'report_to': 'wandb', 'low_cpu_mem_usage': False, 'metric_for_best_model': 'loss'}
2024-04-25 13:45:30,447 INFO    MainThread:2311240 [wandb_watch.py:watch():51] Watching

Here is the debug-internal.log:
debug-internal.log

commented

Same issue. The whole training process will hold for several minutes every fixed steps (rank 0 GPU will be 0% util). Everything returns to normal as I delete the "wandb_watch["all"]".

Hey @nmd2k @fzp0424, thanks for sharing these details! Would you have any problems with setting os.environ["WANDB_WATCH"] = "all" instead of passing as an argument and seeing if you face the same issue?

Hi there, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!