microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I followed the documentation to update the certificate and the cluster crashed.

siaimes opened this issue · comments

Organization Name:

Short summary about the issue/question:

DOC: https://github.com/microsoft/pai/blob/master/docs/manual/cluster-admin/how-to-renew-k8s-cert.md

The root of the issue lies in this line of code:

ansible-playbook -i hosts.yml --limit '!master-node' --become --become-user root renew-worker-cert.yaml

image

As shown in the figure, the master node should use !kube-master to exclude instead of !master-node, which causes the master node to update itself as a worker node, and the cluster crashes.

So this line should be changed to:

ansible-playbook -i hosts.yml --limit '!kube-master' --become --become-user root renew-worker-cert.yaml

Other minor issues:

Currently the etcd of the openpai cluster does not seem to have a certificate, so there is no need to etcd related commands.

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

  • OpenPAI version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:

My one command solution for this doc:

https://github.com/siaimes/renew-k8s-certs

Thanks for this. And there is another option to rotate cert automatically, please refer: https://kubernetes.io/docs/tasks/tls/certificate-rotation/. We have an issue for this #5439