hustvl / VAD

[ICCV 2023] VAD: Vectorized Scene Representation for Efficient Autonomous Driving

Home Page:https://arxiv.org/abs/2303.12077

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Errors occur in the training commands.

h-enomoto opened this issue · comments

Hello,
I have executed the following commands for training purposes.
python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py rojects/configs/VAD/VAD_tiny_stage_1.py --launcher pytorch --deterministic --work-dir ./outputs
Given that I have access to only one GPU, I used the parameter "--nproc_per_node=1".
Subsequently, I encountered the following error:

projects.mmdet3d_plugin
Traceback (most recent call last):
  File "tools/train.py", line 266, in <module>
    main()
  File "tools/train.py", line 183, in main
    cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump
    f.write(self.pretty_text)
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text
    text, _ = FormatCode(text, style_config=yapf_style, verify=True)
TypeError: FormatCode() got an unexpected keyword argument 'verify'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2555) of binary: /home/user1/miniconda3/envs/vad/bin/python
/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 2555 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module>
    main()
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
    return f(*args, **kwargs)
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
    run(args)
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user1/miniconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
         tools/train.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2023-10-16_05:13:20
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 2555)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

Could this be attributed to the parameter settings?
Your advice and guidance on this matter would be highly appreciated.
Thank you.