coreos / ignition

First boot installer and configuration tool

Home Page:https://coreos.github.io/ignition/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CloudStack provider: conflicting fetch between HTTP and ConfigDrive userdata

mlsorensen opened this issue · comments

Bug

When a user sets up a CloudStack network such that DHCP is provided by Virtual Router (so a VR exists on the network), but UserData is provided by ConfigDrive, Ignition's CloudStack provider accepts a 404 from an HTTP UserData request to the VR as empty UserData and ignores the ConfigDrive's userdata.

Operating System Version

3374.2.5

Ignition Version

2.14.0

Environment

CloudStack Network, VR for DHCP provider and ConfigDrive for UserData provider

Expected Behavior

Expected the CloudStack ignition provider to treat a 404 or empty userdata from VR as no userdata and continue to try the config-2 labeled configdrive. I'm actually following up with this on the CloudStack side as well, but when I saw the 404 is also confusing the Ignition side I figured that should probably be addressed here too.

Actual Behavior

HTTP userdata, either empty or 404, was accepted as the userdata for the system, configdrive was ignored.

Reproduction Steps

I think this can actually be reproduced outside of cloudstack if

  1. you create a configdrive labeled config-2 that contains a userdata file at /cloudstack/userdata/user_data.txt and attach it to the VM, and
  2. your DHCP server for the VM also has an http service running on port 80. It doesn't need to host any userdata.

Other Information

Attaching screenshots of console for both the empty userdata (GET result: OK) from HTTP and the 404 (GET result: Not Found). You can see that in both cases they are parsed as valid but empty userdata (per the SHA cf83e...), which causes the real configdrive userdata to be ignored.
ignition
ignition-404

The real userdata is there on configdrive:
configdrive

Once the HTTP server is shut down on the VR, it returns a GET error: and reads data from the config-2 drive.
working-configdrive

Thanks for reporting this. The current code retries indefinitely until it either obtains a config, or positive confirmation of no config, from the config drive or metadata service. If we ignored a legitimate 404 from the metadata service, and the config drive never showed up, we'd end up blocking boot indefinitely. So we'll need a way to distinguish between the no-userdata-provided case and the try-the-configdrive-instead case. Do you know if the metadata service provides a way to do that?

Hi @bgilbert - What I can say is that with CloudStack, both the metadata HTTP server and config drive are set up prior to system boot in a preparation stage, they aren't operated upon in parallel with boot and the config iso is not hot plugged. If we get a 404, or don't find a config drive, it isn't going to show up later.

As far as blocking indefinitely if no userdata was provided to the VM - I think maybe that is a risk regardless, as it's possible to create a VM on a network that does not provide userdata services at all. However, if a network's userdata provider is ConfigDrive, barring a bug in the VM orchestration there will always be a config drive. It will still contain a cloudstack/metadata directory with metadata files, but it will not contain a cloudstack/userdata directory. If a network's userdata provider is VirtualRouter, there will always be a fetchable userdata file, even if it is empty. Additionally there is metadata such as http://{router-ip}/latest/meta-data/instance-id regardless of whether or not userdata was provided.

I'm willing to help develop on this and test it out, however I'm having trouble finding a developer guide that will hold my hand enough to get going. I guess if I check out the source code into a VM and build/install it locally there, I can set up userdata scenarios and then perhaps trigger ignition somehow.

Thanks for the help.

It seems like the existing detection logic for the config drive should be fine, then: if we find a volume with the correct label, we mount it, and it either does or doesn't contain userdata. The problem is with the metadata service, where we need to distinguish between an existent metadata service and a random HTTP server running on the VR. And I think you have the right idea: we should check for the existence of some appropriate cloudstack/metadata item, and if missing, assume the metadata service is invalid rather than treating the cloudstack/userdata 404 as canonical.

The CloudStack provider isn't actively maintained (no test environment) so if you're able to implement this yourself, we'd happily accept a PR. We don't really have a developer guide, but basically:

  • The relevant function is here.
  • ./build builds Ignition and ./test runs unit tests (of which there are none for cloud providers).
  • In this case, you don't need to build a new OS image with your modified Ignition, since you're just testing config fetch. You should be able to just sftp a new Ignition binary to a CloudStack instance (running the same distro version as your build machine). Then:
    sudo rm -f fetched.ign && \
    sudo ./ignition -config-cache fetched.ign -log-to-stdout -platform cloudstack -stage fetch
  • Ask here if you have any questions!

For completeness, re the other parts of your comment:

The config ISO may not be hotplugged, but in general we can't/don't assume the kernel will finish enumerating storage devices in any particular amount of time. Enumeration can be slow on large/heavily loaded systems, so Ignition generally keeps retrying, rather than using a timeout and risking misprovisioning if the timeout is too aggressive.

If the VM has neither a metadata service nor a config drive, I'd say blocking indefinitely is reasonable behavior. Ignition requires that some metadata service exists; the instance isn't going to be useful without one.

At hypershift kubevirt provider we are hitting similar issue, there we use the ignition openstack provider over azure and fetch of metadata server returns 404 so config drives are not readed

Discussed offline with @qinqon. AFAICT the solution for #1574 (comment) is that HyperShift should use the kubevirt provider in KubeVirt instead of the openstack provider.

Discussed offline with @qinqon. AFAICT the solution for #1574 (comment) is that HyperShift should use the kubevirt provider in KubeVirt instead of the openstack provider.

We have tested a image with platform.id=kubevirt and it working fine but we need to wait for coreos/fedora-coreos-tracker#1126 to consume official artifacts.

Thanks for the additional info, @bgilbert @qinqon, I'll take a look. Since it was mentioned there is no active maintenance on this part and no environment/existing tests, I assume there is also no test code I should be adding to for such a change?

Correct, there isn't. For providers that aren't tested via OS-level end-to-end tests, we're entirely dependent on manual testing.