canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization

Home Page:https://cloud-init.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

22.4.2 on Deb 12: sometimes `apt-get update` exits with error 100

corradofiore opened this issue · comments

Bug report

We are observing the following error on some of our VMs in a Managed Instance Group on Google Cloud Platform:

myuser@deb12:~$ cloud-init status --long
status: error
boot_status_code: enabled-by-generator
last_update: Mon, 11 Mar 2024 11:31:05 +0000
detail:
('apt-configure', ProcessExecutionError("Unexpected error while running command.\nCommand: ['eatmydata', 'apt-get', '--option=Dpkg::Options::=--force-confold', '--option=Dpkg::options::=--force-unsafe-io', '--assume-yes', '--quiet', 'update']\nExit code: 100\nReason: -\nStdout: -\nStderr: -"))

Other instances work OK: cloud-init status returns DONE.

Steps to reproduce the problem

Here it is an excerpt of our cloud-init recipe:

## template: jinja
#cloud-config
#
# This template requires a Debian 12 image and systemd-network-wait-online.service enabled
#
merge_how:
  - name: list
    settings: [append]
  - name: dict
    settings: [no_replace, recurse_list]
apt:
  sources:
    sury-php:
      source: "deb [signed-by=$KEY_FILE] https://packages.sury.org/php/ bookworm main"
      keyid: 1505 8500 A023 5D97 F5D1 0063 B188 E2B6 95BD 4743
      keyserver: keyserver.ubuntu.com
    proxysql:
      source: "deb [signed-by=$KEY_FILE] https://repo.proxysql.com/ProxySQL/proxysql-2.5.x/bookworm/ ./"
      keyid: 653F 85BB 3825 6DF8 A962 06C3 E8CA 2E8D 8217 C97E
      keyserver: keyserver.ubuntu.com
    zabbix:
      source: "deb [signed-by=$KEY_FILE] https://repo.zabbix.com/zabbix/6.0/debian bookworm main"
      key: |
        -----BEGIN PGP PUBLIC KEY BLOCK-----
        Version: GnuPG v1.4.10 (GNU/Linux)
        
        mQGiBFCNJaYRBAC4nIW8o2NyOIswb82Xn3AYSMUcNZuKB2fMtpu0WxSXIRiX2BwC
        YXx8cIEQVYtLRBL5o0JdmoNCjW6jd5fOVem3EmOcPksvzzRWonIgFHf4EI2n1KJc
        JXX/nDC+eoh5xW35mRNFN/BEJHxxiRGGbp2MCnApwgrZLhOujaCGAwavGwCgiG4D
        wKMZ4xX6Y2Gv3MSuzMIT0bcEAKYn3WohS+udp0yC3FHDj+oxfuHpklu1xuI3y6ha
        402aEFahNi3wr316ukgdPAYLbpz76ivoouTJ/U2MqbNLjAspDvlnHXXyqPM5GC6K
        jtXPqNrRMUCrwisoAhorGUg/+S5pyXwsWcJ6EKmA80pR9HO+TbsELE5bGe/oc238
        t/2oBAC3zcQ46wPvXpMCNFb+ED71qDOlnDYaaAPbjgkvnp+WN6nZFFyevjx180Kw
        qWOLnlNP6JOuFW27MP75MDPDpbAAOVENp6qnuW9dxXTN80YpPLKUxrQS8vWPnzkY
        WtUfF75pEOACFVTgXIqEgW0E6oww2HJi9zF5fS8IlFHJztNYtbQgWmFiYml4IFNJ
        QSA8cGFja2FnZXJAemFiYml4LmNvbT6IYAQTEQIAIAUCUI0lpgIbAwYLCQgHAwIE
        FQIIAwQWAgMBAh4BAheAAAoJENE9WOR56l7UhUwAmgIGZ39U6D2w2oIWDD8m7KV3
        oI06AJ9EnOxMMlxEjTkt9lEvGhEX1bEh7bkBDQRQjSWmEAQAqx+ecOzBbhqMq5hU
        l39cJ6l4aocz6EZ9mSSoF/g+HFz6WYnPAfRaYyfLmZdtF5rGBDD4ysalYG5yD59R
        Mv5tNVf/CEx+JAPMhp6JCBkGRaH+xHws4eBPGkea4rGNVP3L3rA7g+c1YXZICGRI
        OOH7CIzIZ/w6aFGsPp7xM35ogncAAwUD/3s8Nc1OLDy81DC6rGpxfEURd5pvd/j0
        D5Di0WSBEcHXp5nThDz6ro/Vr0/FVIBtT97tmBHX27yBS3PqxxNRIjZ0GSWQqdws
        Q8o3YT+RHjBugXn8CzTOvIn+2QNMA8EtGIZPpCblJv8q6MFPi9m7avQxguMqufgg
        fAk7377Rt9RqiEkEGBECAAkFAlCNJaYCGwwACgkQ0T1Y5HnqXtQx4wCfcJZINKVq
        kQIoV3KTQAIzr6IvbZoAn12XXt4GP89xHuzPDZ86YJVAgnfKmQENBFeIdv0BCADA
        zkjO9jHoDRfpJt8XgfsBS8FpANfHF2L29ntRwd8ocDwxXSbtBuGIkUSkOPUTx6i/
        e9hd8vYh4mcX3yYpiW8Sui4aXbJu9uuSdU5KvPOaTsFeit9jBDK4b0baFYBDpcBB
        rgQuyviMAVAczu5qlwolA/Vu6DWqah1X9p+4EFa1QitxkhYs3br2ZGy7FZA3f2sZ
        aVhHAPAOBSuQ1W6tiUfTIj/Oc7N+FBjmh3VNfIvMBa0E3rA2JlObxUEywsgGo7FP
        WnwjZyv883slHp/I3H4Or9VBouTWA2yICeROmMwjr4mOZtJTz9e4v/a2cG/mJXgx
        Ce+FjBvTvrgOVHAXaNwLABEBAAG0IFphYmJpeCBMTEMgPHBhY2thZ2VyQHphYmJp
        eC5jb20+iQE4BBMBAgAiBQJXiHb9AhsDBgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIX
        gAAKCRAIKrVroU/lkbO8B/4/MhxoUN2RPmH7BzFGIntKEWAwbRkDzyQOk9TjXVeg
        fsBnzmDSdowh7gyteVauvr62jiVtowlE/95vbXqbBCISLqKGi9Wmbrj7lUXBd2sP
        7eApFzMUhb3G3GuV5pCnRBIzerDfhXiLE9EWRN89JYDxwCLYctQHieZtdmlnPyCb
        FF6wcXTHUEHBPqdTa6hvUqQL2lHLFoduqQz4Q47Cz7tZxnbrakAewEToPcjMoteC
        SfXwF/BRxSUDlN7tKFfBpYQawS8ZtN09ImHOO6CZ/pA0qQimiNiRUfA25onIDWLL
        Y/NMWg+gK94NVVZ7KmFG3upDB5/uefK6Xwu2PsgiXSQguQENBFeIdv0BCACZgfqg
        z5YoX+ujVlw1gX1J+ygf10QsUM9GglLEuDiSS/Aa3C2UbgEa+N7JuvzZigGFCvxt
        AzaerMMDzbliTqtMGJOTjWEVGxWQ3LiY6+NWgmV46AdXik7sUXM155f1vhOzYp6E
        Zj/xtGvyUzTLUkAlnZNrhEUbUmOhDLassVi32hIyMR5W7w6IIi0zIM1mSuLR0H6o
        DEpR3GzuGVHGj4/sLeAg7iY5MziGwySBQk0Dg0xH5YqHb+uKzCTH/ILu3srPJq+2
        37Px/PctAZCEA96ogc/DNF2XjdUpMSaEybR0LuHHstAqkrq8AyRtDJNYE+09jDFd
        UIukhErLuo1YPWqFABEBAAGJAR8EGAECAAkFAleIdv0CGwwACgkQCCq1a6FP5ZH8
        +wf/erZneDXqM6xYT8qncFpc1GtOCeODNb19Ii22lDEXd9qNUlAz2SB6zC5oywln
        R0o1cglcrW96MD/uuCL/+tTczeB2C455ofs2mhpK7nKiA4FM+JZZ6XSBnq7sfsYD
        6knbvS//SXQV/qYb4bKMvwYnyMz63escgQhOsTT20ptc/w7fC+YPBR/rHImKspyI
        wxyqU8EXylFW8f3Ugi2+Fna3CAPR9yQIAChkCjUawUa2VFmm5KP8DHg6oWM5mdqc
        pvU5DMqpi8SA26DEFvULs8bR+kgDd5AU3I4+ei71GslOdfk4s1soKT4X2UK+dCCX
        ui+/5ZJHakC67t5OgbMas3Hz4Q==
        =HHRW
        -----END PGP PUBLIC KEY BLOCK-----

package_update: true
package_upgrade: false

packages:
  - libnss-resolve 
  - iotop
  - iftop
  - ripgrep
  - libxml2-utils
  - libstemmer-dev
  - libstemmer-tools
  - zip
  - unzip
  - p7zip-full
  - apt-transport-https
[...]

Environment details

  • Cloud-init version: 22.4.2
  • Operating System Distribution: Debian 12
  • Cloud provider, platform or installer type: Google Cloud Platform. This happens on vanilla VMs (N2D, 4 vCPU, 4 GB RAM) in a Managed Instance Group.

A potentially relevant piece of information:

  • VMs in a Managed Instance Group get their IP address via DHCP.
  • Sometimes the DHCP negotiation can take a few seconds, during which the network is considered to be "UP" but in reality the OS cannot make requests via HTTP due to the lack of an IP address.
  • This is why we enabled systemd-network-wait-online.service at the disk image level, not via Cloud-init.
  • Our hypothesis is that apt-get update was executed during that brief phase of DHCP negotiation.

cloud-init logs

[...]
2024-03-11 11:27:45,830 - helpers.py[DEBUG]: Running update-sources using lock (<FileLock using file '/var/lib/cloud/instances/3291944383708091516/sem/update_sources'>)
2024-03-11 11:27:45,831 - debian.py[DEBUG]: Waiting for apt lock
2024-03-11 11:27:45,832 - debian.py[DEBUG]: apt lock available
2024-03-11 11:27:45,832 - subp.py[DEBUG]: Running command ['eatmydata', 'apt-get', '--option=Dpkg::Options::=--force-confold', '--option=Dpkg::options::=--force-unsafe-io', '--assume-yes', '--quiet', 'update'] with allowed return codes [0] (shell=False, capture=False)
2024-03-11 11:27:49,268 - util.py[DEBUG]: apt-update [eatmydata apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet update] took 3.436 seconds
2024-03-11 11:27:49,269 - handlers.py[DEBUG]: finish: modules-config/config-apt-configure: FAIL: running config-apt-configure with frequency once-per-instance
2024-03-11 11:27:49,269 - util.py[WARNING]: Running module apt-configure (<module 'cloudinit.config.cc_apt_configure' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py'>) failed
2024-03-11 11:27:49,269 - util.py[DEBUG]: Running module apt-configure (<module 'cloudinit.config.cc_apt_configure' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/modules.py", line 246, in _run_modules
    ran, _r = cc.run(
              ^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 67, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
              ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py", line 196, in handle
    apply_apt(apt_cfg, cloud, target)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py", line 247, in apply_apt
    add_apt_sources(
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py", line 629, in add_apt_sources
    update_packages(cloud)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py", line 542, in update_packages
    cloud.distro.update_package_sources()
  File "/usr/lib/python3/dist-packages/cloudinit/distros/debian.py", line 293, in update_package_sources
    self._runner.run(
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
              ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/distros/debian.py", line 287, in package_command
    self._wait_for_apt_command(
  File "/usr/lib/python3/dist-packages/cloudinit/distros/debian.py", line 223, in _wait_for_apt_command
    return util.log_time(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2680, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 335, in subp
    raise ProcessExecutionError(
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['eatmydata', 'apt-get', '--option=Dpkg::Options::=--force-confold', '--option=Dpkg::options::=--force-unsafe-io', '--assume-yes', '--quiet', 'update']
Exit code: 100
Reason: -
Stdout: -
Stderr: -
2024-03-11 11:27:49,277 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
[...]

CC: @Francesco-n-dev

I don't believe enabling systemd-networkd-wait-online is an issue as we block on that in cloud-init on Ubuntu too prior to performing apt udpate or and package management operations.
It's unfortunate our logs don't should stdout/stderr output for the failed apt-get update command. I wonder if we get somethjing in /var/log/cloud-init-output.log too that adds more info here about the failure. I've seen exit 100 for multiple reasons from invalid apt config files, to temporarily inaccessible APT repos, to proxy errors. So it could be a number of issues.

If possible, when the a node fails again can you please run cloud-init collect-logsand attack the tar.gz to this issue? It would be helpful to correlate timing in journalctl logs against the apt update run and systemd-network. Also, if possible after a faillure try to manually to run apt update on the command line to confirm that apt update works at a later date on this node (ruling out persistent network issues for the node).

I'm marking incomplete until we also have cloud-init.tar.gz logs so we can peek more as I'm not certain we have enough information/feedback at the moment to understand this race. It's possible we could provide a mechanism to retry a couple times on exit 100 for apt update.... But, I'm hestant to put that bandaid in place until we better understand the cause of this symptom.

Hi Chad,

thanks a lot for your reply.

when the a node fails again can you please run cloud-init collect-logs and attack the tar.gz to this issue?

Sure thing. I did not know about the collect-logs command.

It's possible we could provide a mechanism to retry a couple times on exit 100 for apt update.... But, I'm hestant to put that bandaid in place until we better understand the cause of this symptom.

100% agree.

This is why we enabled systemd-network-wait-online.service at the disk image level, not via Cloud-init.

@corradofiore are you saying that these debian 12 cloud images don't already have this enabled?