GoogleCloudPlatform / guest-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Google Guest Agent hangs when starting with no IP on the main interface

shneor521 opened this issue · comments

Description

When the google-guest-agent service starts but the primary interface has no IP, it remains stuck.
The status of the service is constantly changing between activating and deactivating.

This causes other dependent networking commands like systemctl start systemd-networkd, and netplan apply to also get stuck.

google-guest-agent.service - Google Compute Engine Guest Agent
   Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vend
or preset: enabled)
   Active: deactivating (stop-sigterm) (Result: timeout) since Wed 2022-08-24 16
:54:26 UTC; 23min ago
 Main PID: 25708 (google_guest_ag)
    Tasks: 10 (limit: 4660)
   CGroup: /system.slice/google-guest-agent.service
           └─25708 /usr/bin/google_guest_agent

Aug 24 17:15:58 fresh systemd[1]: Starting Google Compute Engine Guest Agent...
Aug 24 17:15:58 fresh google_guest_agent[25708]: GCE Agent Started (version 2022
0622.00-0ubuntu2~18.04.0)
Aug 24 17:17:28 fresh systemd[1]: google-guest-agent.service: Start operation ti
med out. Terminating.
Aug 24 17:17:43 fresh google_guest_agent[25708]: CRITICAL main.go:298 error regi
stering service: failed to shutdown within timeout 15s

Setup

I'm using a GCP VM with Ubuntu 18.0.4 image.
The image ID (sourceImage) is projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220810.

One primary interface. No special configurations.

Steps to reproduce

systemctl stop google-guest-agent
ifconfig ens4 0
systemctl start google-guest-agent

The last command will hang and not finish, and no IP is configured on the interface.
Potentially nothing is avoiding acquiring a new IP.
When this happens, netplan apply finishes but doesn't resolve the IP.
systemctl restart systemd-networkd allows recovery after the agent failed to deactivate, but the subsequent activation succeeded.

Note

This commit made this stuck issue pop up more frequently since starting the systemd-networkd, causes google-guest-agent to start.
So, If the systemd-networkd process completes successfully, but the main interface does not get an IP (for whatever reason), the start of google-guest-agent will be stuck.

This is by design - the agent hangs until network is available. The guest agent is not responsible for configuring the primary interface.

I understand what you are saying, but as explained in the description, other network commands are also stuck/don't work as expected during this condition.

For example, if a user wants to gain a new IP, he assumes he can execute netplan apply.
When the "google-guest-agent" is in activating/deactivating loop mode, netplan apply does nothing, the interface doesn't get IP, and the system continues to be non-operational.

If a user runs systemctl stop systemd-networkd and systemctl start systemd-networkd, it doesn't work either.

If a user runs systemctl restart systemd-networkd, it recovers.

netplan apply is the standard Ubuntu way to set up the network and gain a new IP, so when "google-guest-agent" is in this loop mode, it prevents the system network from operating normally.

It looks like "google-guest-agent" has a collateral impact on the normal networking operation of the instance.
It's possible to recover from this state, but the user experience is impacted, and finding the exact way to recover is not straightforward for a normal user.

So it's possible to claim that "google-guest-agent" has no issue, and the issue is, for example, in netplan apply, but the fact is that any Ubuntu system works appropriately. Only the google cloud instances suffer from this issue due to this collateral impact.

The guest agent is a critically required service for networking when installed, but it takes as an explicit assumption that the network has already been started. Otherwise, it will hang indefinitely until networking is available. If you want to use a custom network setup, you can disable or uninstall the guest agent (or the guest environment as a whole). We have no plans to make the guest agent accept a no-network situation, since it is intended by design to work only when networking is up.