microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[bug report] When a task was cloned, the TensorBoard port was not regenerated, so the TensorBoard could not be started.

siaimes opened this issue · comments

Organization Name:

Short summary about the issue/question:
When a task was cloned, the TensorBoard port was not regenerated, so the TensorBoard could not be started.
Brief what process you are following:
Clone a job.

How to reproduce it:
Submit a job that TensorBoard is enabled and clone it, you will find that the TensorBoard part in the YAML configuration file of the two jobs is exactly the same, which will cause the cloned job to fail to start TensorBoard.

OpenPAI Environment:

  • OpenPAI version: v1.8.0
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:

Instead of specifying the port number in the YAML file, we should use the same method as the ssh function to pick an available port and feed it back to the frontend.