GoogleCloudPlatform / guest-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Setting MetadataScripts startup to false may result in google-startup-scripts service to enter failed state

action opened this issue · comments

commented

Problem:
After disabling startup scripts in the instance config, the google-startup-scripts service may enter a failed state after rebooting the associated VM.

Expectation:
The google-startup-scripts service does not enter a failed state, after booting, because startup scripts are disabled in the instance config.


Snippet of detected failure:

$ systemctl --failed --all
  UNIT                           LOAD   ACTIVE SUB    DESCRIPTION
● google-startup-scripts.service loaded failed failed Google Compute Engine Startup Scripts

Take a look at the service's status:

$ sudo systemctl status google-startup-scripts.service
● google-startup-scripts.service - Google Compute Engine Startup Scripts
   Loaded: loaded (/lib/systemd/system/google-startup-scripts.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2021-04-19 18:32:44 UTC; 3min 6s ago
  Process: 3016 ExecStart=/usr/bin/google_metadata_script_runner startup (code=exited, status=2)
 Main PID: 3016 (code=exited, status=2)

Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: Starting Google Compute Engine Startup Scripts...
Apr 19 18:32:44 aa-qa-6080-gcp0 google_metadata_script_runner[3016]: startup scripts disabled in instance config
Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: google-startup-scripts.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: google-startup-scripts.service: Failed with result 'exit-code'.
Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: Failed to start Google Compute Engine Startup Scripts.

Inspect contents of /etc/default/instance_configs.cfg:

$ cat /etc/default/instance_configs.cfg | tail -8
#
# Disable user supplied startup/shutdown scripts from running on
# the engine.
#
[MetadataScripts]
shutdown = false
startup = false
# END ANSIBLE MANAGED BLOCK

The service logged that it failed due to the result of an an "exit-code", let's take a closer look:

$ sudo google_metadata_script_runner startup
startup scripts disabled in instance config

$ echo $?
2

Details of the google-guest-agent package:

$ dpkg-query --status google-guest-agent
Package: google-guest-agent
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 23901
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Version: 20201217.02-0ubuntu1~18.04.0
Replaces: gce-compute-image-packages (<< 20191115)
Depends: libc6 (>= 2.4)
Breaks: gce-compute-image-packages (<< 20191115), python3-google-compute-engine
Description: Google Compute Engine Guest Agent
 Contains the guest agent and metadata script runner binaries.
Built-Using: golang-1.13 (= 1.13.8-1ubuntu1~18.04.2)
Homepage: https://github.com/GoogleCloudPlatform/guest-agent

Please let me know if there is any additional information I can provide that will be helpful to reproduce, diagnose, or address the issue.

Our expectation was that by following the instructions (found here: https://github.com/GoogleCloudPlatform/guest-agent#configuration) to disable startup scripts, the associated services would continue to execute gracefully. It was an unexpected result to find the google-startup-scripts service in a failed state.

this is by design. please take note of the output message when you invoke the startup script runner: "startup scripts disabled in instance config".

can you share what the impact of having this service in ActiveState=failed ?

commented

Thank you for the quick response and explaining that this might be expected behavior.

My team finds it a little odd for a service to be in a failed state and for a failed state to be an expected state for a service. Our general expectation is that a failed service means that attention is required. We expect that no services should be in a failed state on our system when things are configured successfully and operating as expected.

The impact of having a service in a failed state is that our process for certifying that our product is running successfully on GCP is resulting in an error.

Yes, I understand it is not intuitive. With systemd, it is not necessarily expected that every service will succeed, and many situations where the condition features for systemd units are insufficient simply allow their services to fail. Systems administrators are expected to define their own conditions for which services must succeed and what to do in case of failures.

We will consider changing the exit behavior in this scenario, but it takes time to roll out such changes. In the meantime, when you add the instance config entry, you can also disable the startup scripts service, which should resolve the issue for you immediately.

commented

Thank you for addressing my questions and concerns. Your feedback was helpful and informative. Appreciate you.

We look forward to the exit behavior changing in regards to this scenario. We understand that changes can take time, and we will keep an eye on this issue to stay informed of any changes.

In the meantime, I will have my team look into how we can mitigate the issue on our end; including, as you suggested, disabling the google-startup-scripts service.