microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sometimes submit job failed.

zsh4614 opened this issue · comments

Organization Name:HIT

Short summary about the issue/question:
sometimes submit job failed.
Brief what process you are following:
when I submit ad job, it occurs error as follow:

[Exit Trigger Info]

ExitTriggerMessage: FailedTaskCount 1 has reached MinFailedTaskCount 1 in the TaskRole
ExitTriggerTaskRole: taskrole
ExitTriggerTaskIndex: 0

--------------------------------------------------------------------------------

[Exit Spec]

code: 1
phrase: PAIRuntimeExitAbnormally
issuer: PAI_RUNTIME
causer: PAI_RUNTIME
type: PLATFORM_FAILURE
stage: UNKNOWN
behavior: UNKNOWN
reaction: RETRY_TO_MAX
reason: 'PAI Runtime exit abnormally with undefined exitcode, it may have bugs'
repro:
  - PAI Runtime exits with exitcode 1
solution:
  - Contact PAI Dev to fix PAI Runtime bugs


--------------------------------------------------------------------------------

[Exit Diagnostics]

Pod failed: PodPattern unmatched:
containers:
  - name: init
    reason: Completed
    code: 0
  - name: app
    reason: Error
    message: >
      standard_init_linux.go:228: exec user process caused: no such file or
      directory
    code: 1

what 's the reason about this error? I need help, thanks!

PLATFORM_FAILURE

How to reproduce it:
submit a new job.

OpenPAI Environment:

  • OpenPAI version: v1.8.0

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release):

  • Kernel (e.g. uname -a): Linux rsgpuserver154 4.15.0-166-generic 174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

  • Hardware (e.g. core number, memory size, storage size, GPU type etc.): A40

  • Others:

Anything else we need to know:

Which docker image do you use. We need include bash inside docker image