Error running evaluation

Question

Error running evaluation

jarvishou829 opened this issue 10 months ago · comments


warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[                                                  ] 4/6019, 0.4 task/s, elapsed: 9s, ETA: 14005s/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6084/6019, 21.4 task/s, elapsed: 284s, ETA:    -2sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795432 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795433 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795434 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 795435) of binary: /opt/conda/envs/fbocc/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/fbocc/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/fbocc/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 317, in run_module
    run_module_as_main(options.target, alter_argv=True)
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 238, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools_mm/test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-30_01:11:31
  host      : fd-taxi-f-houjiawei-1691719963348-985c0990-3609275187
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 795435)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 795435
============================================================

The evaluation run 6084 samples rather than 6019 samples and closed unexpectly.

Jarvis Hou · Answer 1 · Wed Aug 30 2023 09:36:26 GMT+0800 (China Standard Time)

It seems to be a memory overflow issue. The memory occupied increases abnormally when the evaluation process is almost finished.

lubinBoooos · Answer 2 · Fri Oct 20 2023 10:38:49 GMT+0800 (China Standard Time)

You can try do evaluation only on one GPU

tanatomoe · Answer 3 · Wed Jan 17 2024 02:21:53 GMT+0800 (China Standard Time)

Hi, I get the same error. I've read you can fix it by changing batch size, but unfortunately I can't figure out how to do that. Perhaps you could try it, and if it works tell me how to do it? It would be greatly appreaciated.