microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to retrieve log for submitted jobs

chinkit-ffc opened this issue · comments

Short summary about the issue/question:
Job details page cannot retrieve log:
image

Console Tab log:
image

Network Tab log:
image

Brief what process you are following:
OpenPAI functioning well for months until we start to notice that log of submitted jobs can no longer be retrieved on job details page since last week (29 Dec 2021).

Check the rest-server container server.log file, here is the content:
image

Master node host machine and other PC in LAN can successfully ping worker node IP address and call the log manager 9103:healthz api, but not from within the rest-server container. The rest server container has no internet access as well. Below is the content of /etc/resolve.conf inside rest-server container:
image

We try to change the rest-server.yaml.template file to use host network (hostNetwork and hostPID set to true, container port set to 9186 (i.e. same as server port)) and restart the rest-server service. After doing so, the rest-server container can ping worker node IP address and call the log manager healthz api. But at the same time, login function, browse job list function, list services function in administration tab of web portal become not functioning. We have tried to change the webportal service to use host network as well but after doing so, webportal service cannot be started, just stuck at "webportal service not yet ready" message.

We also try to configure pylon service to use host network but the log retrieval function still fails with the same error shown above.

Any help or suggestion to further diagnose and resolve this issue would be greatly appreciated.

OpenPAI Environment:

  • OpenPAI version: v1.7.0
  • Cloud provider or hardware configuration: 1 dev-box machine, 1 master node and 1 worker node
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.2 LTS
  • Kernel (e.g. uname -a): Linux paimaster 5.11.0-43-generic #47~20.04.2-Ubuntu SMP Mon Dec 13 11:06:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.): 12 cores, 32GB ram, 500GB SSD on both master and worker node, GTX2060 on worker node.

So you can not ping work node inside the rest-server container. Can you try to change the /etc/resolve.conf inside the rest-server container to the same content of the file in host? Just want to make sure it's not a DNS issue.

And which CNI do you use, weave or calico. Does CNI pod works well? You can run kubectl get po -n kube-system check if all pod run correctly. And try to get the CNI pod log, check if there is error for CNI pod

Thanks a lot @Binyang2014. Appreciate your help. The issue has been resolved.

Apparently the calico-node pod is not running and keep restarting. I follow the suggestion here to set the IP_AUTODETECTION_METHOD and the calico-node is able to start normally, hence log can be retrieved properly at job details page. Anyway, still not sure why this issue arises out of sudden.