cloud-init upgrade causes vultr init networking to fail.

Question

cloud-init upgrade causes vultr init networking to fail.

pnearing opened this issue 2 months ago · comments

On upgrading an Ubuntu Mantic server on Vultr I started getting an error on boot:

2024-03-23 15:02:27,710 - url_helper.py[WARNING]: Calling 'None' failed [119/120s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /2
009-04-04/meta-data/instance-id (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7aefa0738690>: Failed to establish a new connection: [Errno 101] Network
is unreachable'))]

And now this message appears on login:

This system is using the EC2 Metadata Service, but does not appear to
be running on Amazon EC2 or one of cloud-init's known platforms that
provide a EC2 Metadata service. In the future, cloud-init may stop
reading metadata from the EC2 Metadata Service unless the platform can
be identified.

If you are seeing this message, please file a bug against
cloud-init at
https://github.com/canonical/cloud-init/issues
Make sure to include the cloud provider your instance is
running on.

For more information see
#2795

After you have filed a bug, you can disable this warning by
launching your instance with the cloud-config below, or
putting that content into
/etc/cloud/cloud.cfg.d/99-ec2-datasource.cfg

cloud-config
datasource:
Ec2:
strict_id: false

Disable the warnings above by:
touch /root/.cloud-warnings.skip
or
touch /var/lib/cloud/instance/warnings/.skip

Any more information you might require please let me know.

Chad Smith · Answer 1 · Mon Mar 25 2024 22:59:32 GMT+0800 (China Standard Time)

Can you please perform a cloud-init collect-logs on the system and attach the .tgz to this bug to give us a bit more information. Also are there any other files in /etc/cloud/cloud.cfg at play here?

Chad Smith · Answer 2 · Mon Mar 25 2024 23:45:31 GMT+0800 (China Standard Time)

Generally speaking, we'd expect Vultr datasource to be discovered here if we are using Latest cloud-init on Mantic. So, I'm presuming we have an issue earlier in logs that lead to Ec2 being detected instead of Vultr. The cloud-init collect-logs requested above will hopefully give us all the information we need to discover how this instance managed to not detect Vultr and fallback to Ec2. the logs that are of most interest here (which will be included in that tar file) are /run/cloud-init/ds-identify.log (initial datasource detection) and /var/log/cloud-init.log (which will potentially show us why Vultr wasn't detected). cloud-init status --format=json may also show you known errors quickly.

Chad Smith · Answer 3 · Tue Mar 26 2024 00:42:56 GMT+0800 (China Standard Time)

Yes something else is going on here besides just upgrade path. I launched a Vultr mantic 23.10 instance w/ cloud-init 23.3.3 and upgraded to latest cloud-init 23.4.4 and rebooted with no issues in Vultr datasource detection.

I did recognize a known small bug dealing a warning about scripts/vendor which has already landed in #4986. But that issue would not have cause Vultr datasource to go undiscovered.

root@test-mantic:~# cloud-init --version
/usr/bin/cloud-init 23.4.4-0ubuntu0~23.10.1
root@test-mantic:~# cloud-id
vultr

Chad Smith · Answer 4 · Tue Mar 26 2024 00:45:19 GMT+0800 (China Standard Time)

CC: @eb3095 just FYI as I don't see a problem at the moment, but we'll wait on logs.

Peter Nearing · Answer 5 · Tue Mar 26 2024 07:38:21 GMT+0800 (China Standard Time)

Please find attached the logs, as well, I've not changed any config in /etc/cloud.
cloud-init.tar.gz

Chad Smith · Answer 6 · Tue Mar 26 2024 12:13:18 GMT+0800 (China Standard Time)

Thanks a lot for the logs @pnearing. Near as I can tell something between 03/09 and the reboot after 03/23 somehow altered the list of datasources that cloud-init tried to discover on this system from [ Vultr, None] to the full list of all potential datasources
2024-03-23 13:31:15,412 - __init__.py[DEBUG]: Looking for data source in: ['NoCloud', 'ConfigDrive', 'OpenNebula', 'DigitalOcean', 'Azure', 'AltCloud', 'OVF', 'MAAS', 'GCE', 'OpenStack', 'CloudSigma', 'SmartOS', 'Bigstep', 'Scaleway', 'AliYun', 'Ec2', 'CloudStack', 'Hetzner', 'IBMCloud', 'Oracle', 'Exoscale', 'RbxCloud', 'UpCloud', 'VMware', 'Vultr', 'LXD', 'NWCS', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']

Normally /usr/lib/cloud-init/ds-identify would filter this list of datasources to only what could be viable, but there is configuration on this Vultr instance that is setting manual_cache_clean: true which prevents ds-identify from trying to filter this list of datasources in systemd generator timeframe. You can see that breadcrumb comment in /run/cloud-init/ds-identify.log manual_cache_clean enabled. Not writing datasource_list. This prevents ds-itentify from writing out /run/cloud-init/cloud.cfg with a limited datasource_list: [ Vultr, None ] set of values. Therefore you now see in latest cloud-init logs that cloud-init spends a long time trying to detect a whole bunch on inapplicable datasources, and that lovely error banner message telling you to file a bug.

Normally I expect to see /etc/cloud/cloud.cfg with a datasource_list: ['Vultr'] limited datasource_list. Did something change in /etc/cloud or /etc/cloud/cloud.cfg.d/*cfg files to change this default setting? This ultimately is likely the problem we are seeing here.

, the file /var/lib/cloud/instance/manual-clean exists on this machine or manual_cache_clean: true was set in configuration or user-data. When manual-clean marker exists. ds-it

James Falcon · Answer 7 · Tue Mar 26 2024 20:08:44 GMT+0800 (China Standard Time)

Did something change in /etc/cloud or /etc/cloud/cloud.cfg.d/*cfg files to change this default setting? This ultimately is likely the problem we are seeing here.

No config change needed. The python version changed (presumably on upgrade) causing the cache to clear.

eb3095 · Answer 8 · Tue Mar 26 2024 21:16:29 GMT+0800 (China Standard Time)

Yeah this sounds like an extension of one of the issues I was dealing with in IRC. manual_cache_clean: true was added, I had mentioned we were doing this, because we were seeing issues where if something broke with our host networking or dhcp would fail it would cause the server to re-init as nocloud or whatever that default was cycling all the keys and the root user which was a terrible UX for our users and caused it to re-init yet again when networking came back. That was the only solution we were able to find to prevent this behavior, but we didnt make that default option, we change it in our own provided cloud.cfg in our images.

I'de be happy to reopen this issue and find a more amicable solution so we did not need to do that.

Chad Smith · Answer 9 · Tue Mar 26 2024 23:44:00 GMT+0800 (China Standard Time)

@eb3095 I did see the provided /etc/cloud/cloud.cfg in your images which does limit datasource_list: [ Vultr, None ]. I think if you are generating images, and packaging files delivered to /etc/cloud for configuration, you may want to write the datasource_list configuration to a file like /etc/cloud/cloud.cfg.d/95-ds-vultr.cfg containing:

datasource_list: [ Vultr, None]

The reason being that cloud-init upstream(and dpkg-reconfigure cloud-init) will write /etc/cloud/cloud.cfg.d/90_dpkg.cfg which overrides the default datasource_list to the potentially long list of all datasources that we see above causing tracebacks and errors because Ec2 datasource will get detected on Vultr platforms if Ec2 is before Vultr in the datasource_list. Whatever /etc/cloud/cloud.cfg.d file you choose it'll need to be lexicographically sorted later above 90_dpkg

eb3095 · Answer 10 · Tue Mar 26 2024 23:45:28 GMT+0800 (China Standard Time)

Fantastic, I will get right on that. Thanks for the advice.