RuntimeError: Failed to make directory fo profiling
AlbertZhangHIT opened this issue · comments
When profiling NPUs in multi-machine scenario, the error failing to make directory for storing tracing data occured.
Environment:
OS: ubuntu 20.04
Arch: aarch64
Python: 3.10
torch: 2.1.0
torch-npu: 2.1.0
Snipes:
with torch_npu.profiler.profile(
activities=[torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU],
schedule=torch_npu.profiler.schedule(wait=1, warmup=2, active=5, skip_first=100),
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(dir_name=os.path.join(self.args.output_ckpt_path, "profiling")),
profile_memory=True,
record_shapes=True,
with_stack=True,
experimental_config=torch_npu.profiler._ExperimentalConfig(
profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization,
l2_cache=False,
data_simplification=False),
) as profiler:
Errors:
2024-02-23 12:31:32 [WARNING] [332] profiler.py: Incorrect schedule: WARMUP followed by NONE
Traceback (most recent call last):
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/utils/path_manager.py", line 134, in make_dir_safety
os.makedirs(path, mode=cls.DATA_DIR_AUTHORITY)
File "/home/HwHiAiUser/anaconda3/lib/python3.10/os.py", line 225, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/job/output/profiling/Euler_332_20240223123131.626_ascend_pt'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/job/file/pretrain_profiling.py", line 256, in train
profiler.step()
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler.py", line 79, in step
self._action_controller.transit_action()
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 90, in transit_action
action()
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 96, in init
path = self._on_trace_ready.create_prof_dir()
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 36, in create_prof_dir
PathManager.make_dir_safety(total_path)
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/utils/path_manager.py", line 136, in make_dir_safety
raise RuntimeError(msg) from err
RuntimeError: Failed to make directory: /job/output/profiling/Euler_332_20240223123131.626_ascend_pt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/job/file/pretrain_profiling.py", line 354, in <module>
main()
File "/job/file/pretrain_profiling.py", line 350, in main
train_and_validate(args)
File "/job/file/pretrain_profiling.py", line 152, in train_and_validate
trainer.run(args.epochs)
File "/job/file/pretrain_profiling.py", line 186, in run
self.train()
File "/job/file/pretrain_profiling.py", line 189, in train
with torch_npu.profiler.profile(
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler.py", line 70, in __exit__
self._action_controller.transit_action()
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 90, in transit_action
action()
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 103, in start_prof
self._msprofiler_interface.start_profiler()
File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/msprofiler_c_interface.py", line 51, in start_profiler
torch_npu._C._profiler._start_profiler(self.msprof_config, self.activities)
TypeError: _start_profiler(): incompatible function arguments. The following argument types are supported:
1. (config: torch_npu._C._profiler.NpuProfilerConfig, activities: Set[torch_npu._C._profiler.ProfilerActivity], scopes: Set[torch._C._profiler.RecordScope] = set()) -> None
It is weird that if I set skip_first
to 0, the error disappeared.
I also found that there may be a bug in creating directories here. The function make_dir_safety
may not be safe especially in multi-threads case. We should at least add exist_ok=True
when using os.makedirs
to avoid potential errors.
you can add the worker_name on the torch_npu.profiler.tensorboard_trace_handler,
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(dir_name=os.path.join(self.args.output_ckpt_path, "profiling"), worker_name="rank_"+str(torch.distributed.get_rank()))