microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

internal-storage-create is not ready yet. Please wait for a moment!

poetryben88 opened this issue · comments

  1. when i install openpai, run "/bin/bash quick-start-service.sh", it occurs this errors:

..........
internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is not ready yet. Please wait for a moment!
An issue occure when starting up internal-storage-create
2021-10-09 15:43:39,333 [ERROR] - deployment.paiLibrary.common.linux_shell : Failed to execute the start script of service internal-storage
2021-10-09 15:43:39,334 [ERROR] - deployment.paiLibrary.paiService.service_management_start : Failed to start service internal-storage
2021-10-09 15:43:39,334 [INFO] - deployment.paiLibrary.paiService.service_management_start : -----------------------------------------------------------
2021-10-09 15:43:39,334 [ERROR] - deployment.paiLibrary.paiService.service_management_start : Have retried 5 times, but service internal-storage doesn't start. Please check it.

2. kubectl get ds & kubectl get node --show-labels INFO:

_root@fxkj:/usr/local/lib# kubectl get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
internal-storage-create-ds 1 1 0 1 0 39h
marketplace-db-ds 1 1 1 1 1 39h
root@fxkj:/usr/local/lib#

-----------internal-storage-create-ds is not ready.

root@fxkj:/usr/local/lib# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
pai-master Ready master 40h v1.15.11 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=pai-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=,pai-master=true,pai-worker=false
pai-worker1 Ready 40h v1.15.11 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=pai-worker1,kubernetes.io/os=linux,pai-worker=true
root@fxkj:/usr/local/lib#_

root@fxkj:/usr/local/lib# kubectl get pods
NAME READY STATUS RESTARTS AGE
frameworkcontroller-sts-0 1/1 Running 0 38h
internal-storage-create-ds-xhsmj 0/1 Running 0 38h
marketplace-db-ds-lp27s 1/1 Running 0 38h
prometheus-deployment-7d87d7c8d-xnz9n 1/1 Running 0 38h
root@fxkj:/usr/local/lib# kubectl logs internal-storage-create-ds-xhsmj
Creating storage.ext4 of 30G, please wait...
mount: /paiInternal/storage.ext4: failed to setup loop device: No such file or directory
mount failed!
root@fxkj:/usr/local/lib#

  1. It seems /paiInternal/storage.ext4 not created. How do I solve it ? Please help me.

The "/" directory still has 100G+ Avail.

  1. dmesg

root@fxkj:/usr/local/lib# dmesg | tail
[2120546.939680] audit: type=1400 audit(1633622410.263:101): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=1249859 comm="cups-browsed" capability=23 capname="sys_nice"
[2188391.029848] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
[2188391.029872] IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
[2188391.030155] IPVS: ipvs loaded.
[2188391.310905] IPVS: [rr] scheduler registered.
[2188391.563727] IPVS: [wrr] scheduler registered.
[2188392.797697] IPVS: [sh] scheduler registered.
[2188435.838861] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[2228600.981557] audit: type=1400 audit(1633730462.386:102): apparmor="DENIED" operation="open" profile="/usr/sbin/ntpd" name="/snap/bin/" pid=2239327 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[2379756.841111] audit: type=1400 audit(1633881615.580:103): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=2681473 comm="cups-browsed" capability=23 capname="sys_nice"
root@fxkj:/usr/local/lib#

  1. fdisk -l

The disk type must be ext4? My one disk is default, other is xfs (LVM).

Disk /dev/nvme0n1: 476.96 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: INTEL SSDPEKNW512G8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: C5F78941-C283-4717-9D9F-786F4CCD19BB

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p2 1050624 1000214527 999163904 476.4G Linux filesystem

Disk /dev/sda: 3.65 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: ST4000NM000A-2HZ
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 615599AB-DB0C-49CE-996B-7278486E53B8

Device Start End Sectors Size Type
/dev/sda1 34 7814037134 7814037101 3.7T Linux LVM

Disk /dev/mapper/ubuntu-opt: 3.65 TiB, 4000783007744 bytes, 7814029312 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

  1. I found the reason and solved it. But openpai data probability won't be saved any more in storage.ext4
    Maybe the docker container "openpai/internal-storage-create" volume is not create correctly.
    I get into the container "openpai/internal-storage-create" , manually run "touch storage/READY" in /paiInternal , then this error is OK, like this below:

internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is not ready yet. Please wait for a moment!
internal-storage-create is ready!
2021-10-11 09:17:26,755 [INFO] - deployment.paiLibrary.paiService.service_management_start : Begin to clean all service's generated template file
2021-10-11 09:17:26,755 [INFO] - deployment.paiLibrary.paiService.service_template_clean : Begin to delete the generated template of internal-storage's service.