Exception in RecordFunction callback: state_ptr INTERNAL ASSERT FAILED at "../torch/csrc/profiler/standalone/nvtx_observer.cpp":115

Question

Exception in RecordFunction callback: state_ptr INTERNAL ASSERT FAILED at "../torch/csrc/profiler/standalone/nvtx_observer.cpp":115

nhkhoi91 opened this issue 3 months ago · comments

khoi.nguyen commented 3 months ago

Bug description

🐛 Bug

I am trying to use PytorchProfiler and write to Tensorboard folder on S3, and get the exception as above

What version are you seeing the problem on?

v2.2

How to reproduce the bug

The code is submitted to AWS Sagemaker via remote function as a training job. I am not sure if that would be part of the problem. Otherwise, code is as below


tensorboard_logs_path = f's3://donut_extraction'
logger = TensorBoardLogger(tensorboard_logs_path, name="donut", version='v1')
processor = DonutProcessor.from_pretrained(MODEL_NAME)
wrap_policy = {DonutSwinEncoder, MBartForCausalLM, DonutSwinModel}
strategy = FSDPStrategy( 
    auto_wrap_policy=wrap_policy,
    state_dict_type="sharded",
    limit_all_gathers=True,
)
device_stats = DeviceStatsMonitor(cpu_stats=True)
model_module = ImageModelModule(train_config,
    processor,
    train_dataloader, 
    val_dataloader,
    version=1
)
profiler = PyTorchProfiler(
      
   on_trace_ready=torch.profiler.tensorboard_trace_handler(f'{tensorboard_logs_path}/profiler0'),
    filename='perf-logs',
    emit_nvtx=True
)
trainer = pl.Trainer(
            devices=4,
            accelerator='cuda',
            accumulate_grad_batches=ACUMULATE_GRAD_BATCHES,
            #max_epochs=train_config.max_epochs,
            max_epochs=4,
            val_check_interval=train_config.val_check_interval,
            check_val_every_n_epoch=2,
            precision="16-mixed",
            num_sanity_val_steps=0,
            callbacks=[device_stats],
            # default_root_dir=ckpt_path,
            strategy=strategy,
            logger=logger,
            profiler=profiler,
        )
trainer.fit(model_module)

Error messages and logs

[rank1]:[2024-05-05 09:59:51,519] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:[2024-05-05 09:59:51,523] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank1]:[2024-05-05 09:59:51,523] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:[2024-05-05 09:59:51,527] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:[2024-05-05 09:59:51,535] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank3]:[2024-05-05 09:59:51,538] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:[2024-05-05 09:59:51,539] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank3]:[2024-05-05 09:59:51,542] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored

Traceback (most recent call last):
  File "/var/folders/h8/1_7bqspx4mj27hqz4qr1gp_m0000gn/T/ipykernel_37353/4058334054.py", line 165, in train_donut
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1032, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 138, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 242, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 184, in run
    closure()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 319, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 390, in training_step
    return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 642, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 849, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 635, in wrapped_forward
    out = method(*_args, **_kwargs)
  File "/var/folders/h8/1_7bqspx4mj27hqz4qr1gp_m0000gn/T/ipykernel_37353/1410972043.py", line 86, in training_step
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1550, in _call_impl
    args_result = hook(self, args)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/profilers/pytorch.py", line 72, in _start_recording_forward
    record.__enter__()
TypeError: nullcontext.__enter__() missing 1 required positional argument: 'self'

[rank2]:[W record_function.cpp:499] Exception in RecordFunction callback: state_ptr INTERNAL ASSERT FAILED at "../torch/csrc/profiler/standalone/nvtx_observer.cpp":115, please report a bug to PyTorch. Expected profiler state set
Exception raised from updateOutputTensorTracker at ../torch/csrc/profiler/standalone/nvtx_observer.cpp:115 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd0d5e76d87 in [/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so))
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd0d5e2775f in [/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so))
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x43 (0x7fd0d5e74873 in [/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so))
frame #3: <unknown function> + 0x56c3f26 (0x7fd0be294f26 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #4: at::RecordFunction::end() + 0x51 (0x7fd0ba5bf411 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #5: at::RecordFunction::~RecordFunction() + 0x22 (0x7fd0ba5bf462 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #6: <unknown function> + 0x4ee58a8 (0x7fd0bdab68a8 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #7: <unknown function> + 0x7a067c (0x7fd0d672267c in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
frame #8: <unknown function> + 0xa480b5 (0x7fd0d69ca0b5 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
frame #9: <unknown function> + 0x4117ab (0x7fd0d63937ab in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
frame #10: <unknown function> + 0x412731 (0x7fd0d6394731 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
<omitting python frames>
frame #22: __libc_start_main + 0xea (0x7fd18ce5ed0a in [/lib/x86_64-linux-gnu/libc.so.6](https://file+.vscode-resource.vscode-cdn.net/lib/x86_64-linux-gnu/libc.so.6))
frame #23: _start + 0x2a (0x55e3c20bf07a in [/usr/local/bin/python](https://file+.vscode-resource.vscode-cdn.net/usr/local/bin/python))
 , for the range [pl][module]torch._dynamo.eval_frame.OptimizedModule: model

Environment

Current environment

* CUDA:
#011- GPU:
#011#011- Tesla V100-SXM2-16GB
#011#011- Tesla V100-SXM2-16GB
#011#011- Tesla V100-SXM2-16GB
#011#011- Tesla V100-SXM2-16GB
#011- available:         True
#011- version:           12.1
* Lightning:
#011- lightning:         2.2.0.post0
#011- lightning-utilities: 0.10.1
#011- pytorch-lightning: 2.2.0.post0
#011- torch:             2.2.0
#011- torchmetrics:      1.3.1
#011- torchvision:       0.17.0
* Packages:
#011- absl-py:           2.1.0
#011- accelerate:        0.27.2
#011- aiobotocore:       2.11.2
#011- aiohttp:           3.9.3
#011- aioitertools:      0.11.0
#011- aiosignal:         1.3.1
#011- asttokens:         2.4.1
#011- async-timeout:     4.0.3
#011- attrs:             23.2.0
#011- authlib:           1.3.0
#011- awscli:            1.32.32
#011- boto3:             1.34.34
#011- botocore:          1.34.34
#011- certifi:           2024.2.2
#011- cffi:              1.16.0
#011- charset-normalizer: 3.3.2
#011- click:             8.1.7
#011- cloudpickle:       2.2.1
#011- colorama:          0.4.4
#011- comm:              0.2.1
#011- contextlib2:       21.6.0
#011- cryptography:      42.0.2
#011- debugpy:           1.8.0
#011- decorator:         5.1.1
#011- dill:              0.3.8
#011- docker:            7.0.0
#011- docutils:          0.16
#011- donut:             0.2.2
#011- dparse:            0.6.4b0
#011- exceptiongroup:    1.2.0
#011- executing:         2.0.1
#011- filelock:          3.13.1
#011- frozenlist:        1.4.1
#011- fsspec:            2024.2.0
#011- google-pasta:      0.2.0
#011- grpcio:            1.60.1
#011- huggingface-hub:   0.20.3
#011- idna:              3.6
#011- importlib-metadata: 6.11.0
#011- ipykernel:         6.29.0
#011- ipython:           8.21.0
#011- jedi:              0.19.1
#011- jinja2:            3.1.3
#011- jmespath:          1.0.1
#011- joblib:            1.3.2
#011- jsonschema:        4.21.1
#011- jsonschema-specifications: 2023.12.1
#011- jupyter-client:    8.6.0
#011- jupyter-core:      5.7.1
#011- lightning:         2.2.0.post0
#011- lightning-utilities: 0.10.1
#011- markdown:          3.5.2
#011- markdown-it-py:    3.0.0
#011- markupsafe:        2.1.5
#011- marshmallow:       3.20.2
#011- matplotlib-inline: 0.1.6
#011- mdurl:             0.1.2
#011- mpmath:            1.3.0
#011- multidict:         6.0.5
#011- multiprocess:      0.70.16
#011- nest-asyncio:      1.6.0
#011- networkx:          3.2.1
#011- nltk:              3.8.1
#011- numpy:             1.26.4
#011- nvidia-cublas-cu12: 12.1.3.1
#011- nvidia-cuda-cupti-cu12: 12.1.105
#011- nvidia-cuda-nvrtc-cu12: 12.1.105
#011- nvidia-cuda-runtime-cu12: 12.1.105
#011- nvidia-cudnn-cu12: 8.9.2.26
#011- nvidia-cufft-cu12: 11.0.2.54
#011- nvidia-curand-cu12: 10.3.2.106
#011- nvidia-cusolver-cu12: 11.4.5.107
#011- nvidia-cusparse-cu12: 12.1.0.106
#011- nvidia-nccl-cu12:  2.19.3
#011- nvidia-nvjitlink-cu12: 12.3.101
#011- nvidia-nvtx-cu12:  12.1.105
#011- packaging:         23.2
#011- pandas:            1.5.3
#011- parso:             0.8.3
#011- pathos:            0.3.2
#011- pexpect:           4.9.0
#011- pillow:            10.2.0
#011- pip:               23.3.2
#011- platformdirs:      4.2.0
#011- pox:               0.3.4
#011- ppft:              1.7.6.8
#011- prompt-toolkit:    3.0.43
#011- protobuf:          4.25.3
#011- psutil:            5.9.8
#011- ptyprocess:        0.7.0
#011- pure-eval:         0.2.2
#011- pyasn1:            0.5.1
#011- pycparser:         2.21
#011- pydantic:          1.10.14
#011- pygments:          2.17.2
#011- python-dateutil:   2.8.2
#011- pytorch-lightning: 2.2.0.post0
#011- pytz:              2024.1
#011- pyyaml:            6.0.1
#011- pyzmq:             25.1.2
#011- rapidfuzz:         3.6.2
#011- referencing:       0.33.0
#011- regex:             2023.12.25
#011- requests:          2.31.0
#011- rich:              13.7.0
#011- rpds-py:           0.18.0
#011- rsa:               4.7.2
#011- ruamel.yaml:       0.18.5
#011- ruamel.yaml.clib:  0.2.8
#011- s3fs:              2024.2.0
#011- s3transfer:        0.10.0
#011- safetensors:       0.4.2
#011- safety-schemas:    0.0.1
#011- sagemaker:         2.208.0
#011- schema:            0.7.5
#011- scikit-learn:      1.4.1.post1
#011- scipy:             1.12.0
#011- sentence-transformers: 2.3.1
#011- sentencepiece:     0.2.0
#011- setuptools:        69.1.0
#011- six:               1.16.0
#011- smdebug-rulesconfig: 1.0.1
#011- smpppdu:           0.1.2
#011- smppy:             0.3.2
#011- stack-data:        0.6.3
#011- sympy:             1.12
#011- tblib:             2.0.0
#011- tensorboard:       2.16.2
#011- tensorboard-data-server: 0.7.2
#011- thefuzz:           0.22.1
#011- threadpoolctl:     3.3.0
#011- tokenizers:        0.15.2
#011- tomli:             2.0.1
#011- torch:             2.2.0
#011- torchmetrics:      1.3.1
#011- torchvision:       0.17.0
#011- tornado:           6.4
#011- tqdm:              4.66.2
#011- traitlets:         5.14.1
#011- transformers:      4.38.0
#011- triton:            2.2.0
#011- typer:             0.9.0
#011- typing-extensions: 4.9.0
#011- urllib3:           2.0.7
#011- wcwidth:           0.2.13
#011- werkzeug:          3.0.1
#011- wheel:             0.42.0
#011- wrapt:             1.16.0
#011- xmltodict:         0.13.0
#011- yarl:              1.9.4
#011- zipp:              3.17.0
* System:
#011- OS:                Linux
#011- architecture:
#011#011- 64bit
#011#011- ELF
#011- processor:         
#011- python:            3.10.8
#011- release:           5.10.210-201.855.amzn2.x86_64
#011- version:           #1 SMP Tue Mar 12 19:03:26 UTC 2024

More info

No response