canonical / microceph

Ceph for a one-rack cluster and appliances

Home Page:https://snapcraft.io/microceph

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing public_network config key after upgrade

sabaini opened this issue · comments

Issue report

What version of MicroCeph are you using ?

Multiple versions

What are the steps to reproduce this issue ?

  1. Install and bootstrap
  2. Upgrade 2x

What happens (observed behaviour) ?

After the second upgrade, the MON fails to come up

What were you expecting to happen ?

All services to be started

Relevant logs, error output, etc.

Detailed steps

# Install and bootstrap

root@ic-2:~# snap install microceph --channel quincy/stable
microceph (quincy/stable) 0+git.4a608fc from Canonical✓ installed
root@ic-2:~# microceph cluster bootstrap

# Observe: we have fsid and keyring.client.admin as conf keys

root@ic-2:~# sudo microceph cluster sql "select * from config"
+----+----------------------+------------------------------------------+
| id |         key          |                  value                   |
+----+----------------------+------------------------------------------+
| 1  | fsid                 | de5fc10b-f2d2-4653-b69a-228173419b30     |
| 2  | keyring.client.admin | AQAQDN5l6MM9HhAA8+lJViDNRBeIK2vZMdnvJw== |
+----+----------------------+------------------------------------------+

# Upgrade

root@ic-2:~# snap refresh microceph --channel reef/stable
2024-02-27T16:21:58Z INFO Waiting for "snap.microceph.mds.service" to stop.
microceph (reef/stable) 18.2.0+snap3f8909a69a from Canonical✓ refreshed

# Observe: conf keys are unchanged

root@ic-2:~# sudo microceph cluster sql "select * from config"
+----+----------------------+------------------------------------------+
| id |         key          |                  value                   |
+----+----------------------+------------------------------------------+
| 1  | fsid                 | de5fc10b-f2d2-4653-b69a-228173419b30     |
| 2  | keyring.client.admin | AQAQDN5l6MM9HhAA8+lJViDNRBeIK2vZMdnvJw== |
+----+----------------------+------------------------------------------+

# Observe: current versions of the snap on the system

root@ic-2:~# ls -la /var/snap/microceph
total 20
drwxr-xr-x 5 root root 4096 Feb 27 16:22 .
drwxr-xr-x 7 root root 4096 Feb 27 16:21 ..
drwxr-xr-x 4 root root 4096 Feb 27 16:21 793
drwxr-xr-x 4 root root 4096 Feb 27 16:21 862
drwxr-xr-x 5 root root 4096 Feb 27 16:21 common
lrwxrwxrwx 1 root root    3 Feb 27 16:22 current -> 862

# Observe: run dir still refers to the previous version

root@ic-2:~# grep run /var/snap/microceph/current/conf/ceph.conf 
run dir = /var/snap/microceph/793/run


# Upgrade a second time

root@ic-2:~# snap refresh microceph --channel reef/beta
2024-02-27T16:24:20Z INFO Waiting for "snap.microceph.mds.service" to stop.
microceph (reef/candidate) 18.2.0+snapa383115101-dirty from Canonical✓ refreshed
Channel reef/beta for microceph is closed; temporarily forwarding to reef/candidate.

# Observe: original snap version rolled over
root@ic-2:~# ls -la /var/snap/microceph
total 20
drwxr-xr-x 5 root root 4096 Feb 27 16:24 .
drwxr-xr-x 7 root root 4096 Feb 27 16:21 ..
drwxr-xr-x 4 root root 4096 Feb 27 16:21 862
drwxr-xr-x 4 root root 4096 Feb 27 16:21 899
drwxr-xr-x 5 root root 4096 Feb 27 16:21 common
lrwxrwxrwx 1 root root    3 Feb 27 16:24 current -> 899

# Observe: ceph.conf still refers to original run dir
root@ic-2:~# grep run /var/snap/microceph/current/conf/ceph.conf 
run dir = /var/snap/microceph/793/run


# Observe: MON fails to come up, as run dir doesn't exist
root@ic-2:~# less /var/log/syslog
...
Feb 27 16:24:29 ic-2 microceph.mon[19273]:   what():  filesystem error: cannot set permissions: No such file or directory [/var/snap/microceph/793/run]
Feb 27 16:24:29 ic-2 microceph.mon[19273]: *** Caught signal (Aborted) **

The problem here is that we rewrite conf on startup, but since at least 287ee68 we expect a public_network item in the cluster config.

Without this config key, ceph.conf rewrite on startup fails silently

We didn't notice the issue in our upgrade testing because snapd keeps the snap data dir of the previous version still around so the MON happily will use the stale run dir there.

Working on a backward-compat shim for config key/value pairs