microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can not receive job status change message

HaoLiuHust opened this issue · comments

Organization Name:

Short summary about the issue/question:

recently, I can not receive job status change message, the error log of alter-manager is like below:
2022-10-25T13:18:52.018Z [ERROR] Failed when handle job status change for job test-code-server_18304946:
meta = {
"message": "read ECONNRESET",
"stack": "Error: read ECONNRESET\n at TCP.onStreamRead (internal/stream_base_commons.js:111:27)",
"config": {
"url": "http://10.1.9.53:80/alert-manager/api/v1/alerts",
"method": "post",
"data": "[]",
"headers": {
"Accept": "application/json, text/plain, /",
"Content-Type": "application/json",
"User-Agent": "axios/0.21.1",
"Content-Length": 2
},
"transformRequest": [
null
],
"transformResponse": [
null
],
"timeout": 0,
"xsrfCookieName": "XSRF-TOKEN",
"xsrfHeaderName": "X-XSRF-TOKEN",
"maxContentLength": -1,
"maxBodyLength": -1
},
"code": "ECONNRESET"
}

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

  • OpenPAI version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:

fix it by modify job status notification source code to skip empty alters
image