microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to backup and restore user data stored by rest-server.

siaimes opened this issue · comments

Organization Name:

Short summary about the issue/question:
My certificate expired, but I got a cluster crash when renewing the certificate. Now I need to reset and reinstall the cluster, but all user data disappeared after this operation, and there is no backup and recovery solution found on GitHub.

volumes:
- name: pai-configuration-rest-server
configMap:
name: pai-configuration
{% if cluster_cfg['authentication']['OIDC'] %}
- name: auth-configuration-rest-server
configMap:
name: auth-configuration
{% endif %}
{%- if cluster_cfg["cluster"]["common"]["cluster-type"] == "k8s" %}
{%- if cluster_cfg['hivedscheduler']['config']|length > 1 %}
- name: hived-spec-rest-server
configMap:
name: hivedscheduler-config
{%- endif %}
- name: k8s-exit-spec-rest-server
configMap:
name: k8s-job-exit-spec-configuration
{%- endif %}
- name: group-configuration-rest-server
configMap:
name: group-configuration
{% if cluster_cfg['cluster']['common']['k8s-rbac'] == 'true' %}
serviceAccountName: rest-server-account
{% endif %}

It seems that rest-server does not mount any directory, so where is its data stored? How can I backup and restore it?

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

  • OpenPAI version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:

If the db file not be deleted, you can recover the data. Here is a guide for this. https://openpai.readthedocs.io/en/latest/manual/cluster-admin/troubleshooting.html#how-to-solve-the-problem
@hzy46 Can you help to take a look?

If the db file not be deleted, you can recover the data. Here is a guide for this. https://openpai.readthedocs.io/en/latest/manual/cluster-admin/troubleshooting.html#how-to-solve-the-problem @hzy46 Can you help to take a look?

User data doesn't seem to be stored here, job data is stored here.

After I reset and installed the cluster, the job data still existed, but the user data was gone, including username, password, e-mail, SSH public Keys et. al.

I see that user information and group information are stored in the Secret, so now the problem seems to be how to backup and restore the Secret of k8s.

You are right, if you delete the data file fot etcd, then user/group info will be lost. We need to dump secrets first then apply them to the new cluster

So running the following command will reset the cluster, but all etcd data will be lost, please be careful.

ansible-playbook -i inventory/pai/hosts.yml -e "ansible_python_interpreter=/usr/bin/python3" reset.yml --become --become-user=root -e "@inventory/pai/openpai.yml"