GoogleCloudPlatform / guest-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

google-guest-agent.service go to dead (inactive) when the VM is built with packer (image) and created with MIGs.

lborguetti opened this issue · comments

Environment

OS: Ubuntu 20.04 LTS
Kernel: 5.11.0-1020-gcp #22~20.04.1-Ubuntu SMP Tue Sep 21 10:54:26 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
SystemD version: systemd 245 (245.4-4ubuntu3.13)
Google Guess Agent version: 20210629.00-0ubuntu1~20.04.0

Problem

We use the packer to create images and launch with MIGs using templates. After Oct 04, 2021 we realized that images built by the packer and released with MIGs do not start the google-guest-agent service and this behavior does not allow the use of Oslogin to connect to virtual machines. With image created on Sep 17, 2021, this behavior does not occur.

The behavior only occurs on first startup by MIGs. If the MIG virtual machine with this behavior is manual restarted (shutdown -r now), the google-guest-agent service will be activated and it will be possible to connect the virtual machines using Oslogin in the next boot.

Details about debugging trying finding the root cause

The image provisioning process by the packer uses ansible and follows these order:

  • Packer creates a VM from Ubuntu 20.04 LTS (ubuntu-os-cloud/ubuntu-minimal-2004-lts);
  • Ansible applies the OS update and restarts the VM, if necessary;
  • Ansible waits for the VM to become available, if necessary;
  • Ansible provisions other services (like nginx);
  • Ansbile shuts down the VM and waits for it to be shut down;
  • Packer creates the image that will be used by the MIG templates.

The unit google-guest-agent.service go to dead (inactive) state after the first reboot by packer/ansible build process and the first boot by the MIG.

Logs before the virtual machine created by MIG is restarted

systemctl status google-guest-agent.service

root@xxx:~# systemctl status google-guest-agent
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

systemd-analyze verify google-guest-agent.service

root@xxx:~# systemd-analyze verify google-guest-agent.service
snap-snapd-13170.mount: Unit is bound to inactive unit dev-loop1.device. Stopping, too.

systemd-analyze critical-chain google-guest-agent.service

root@xxx:~# systemd-analyze critical-chain google-guest-agent.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

└─rsyslog.service @7.388s +111ms
  └─basic.target @7.169s
    └─sockets.target @7.157s
      └─snapd.socket @6.746s +268ms
        └─sysinit.target @6.366s
          └─cloud-init.service @4.478s +1.848s
            └─systemd-networkd-wait-online.service @3.374s +1.093s
              └─systemd-networkd.service @5.479s +65ms
                └─network-pre.target @3.293s
                  └─cloud-init-local.service @2.022s +1.258s
                    └─systemd-udev-trigger.service @858ms +202ms
                      └─systemd-udevd-kernel.socket @761ms
                        └─system.slice @505ms
                          └─-.slice @505ms

google-guest-agent.service logs while the packer/ansible build process is running

note: after the VM is created by MIG there is no more log in the google-guest-agent.service until the service or VM is manual restarted.

root@xxx:~# journalctl -xe --no-pager -u google-guest-agent.service
-- Logs begin at Wed 2021-10-06 20:51:59 UTC, end at Wed 2021-10-06 21:08:56 UTC. --
Oct 06 20:52:08 xxx systemd[1]: Started Google Compute Engine Guest Agent.
-- Subject: A start job for unit google-guest-agent.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit google-guest-agent.service has finished successfully.
--
-- The job identifier is 111.
Oct 06 20:52:08 xxx GCEGuestAgent[572]: 2021-10-06T20:52:08.8406Z GCEGuestAgent Info: GCE Agent Started (version 20210414.00-0ubuntu1~20.04.0)
Oct 06 20:52:09 xxx GCEGuestAgent[572]: 2021-10-06T20:52:09.1992Z GCEGuestAgent Info: Instance ID changed, running first-boot actions
Oct 06 20:52:09 xxx dhclient[662]: Internet Systems Consortium DHCP Client 4.4.1
Oct 06 20:52:09 xxx dhclient[662]: Copyright 2004-2018 Internet Systems Consortium.
Oct 06 20:52:09 xxx dhclient[662]: All rights reserved.
Oct 06 20:52:09 xxx dhclient[662]: For info, please visit https://www.isc.org/software/dhcp/
Oct 06 20:52:09 xxx dhclient[662]:
Oct 06 20:52:09 xxx dhclient[662]: Listening on Socket/ens4
Oct 06 20:52:09 xxx dhclient[662]: Sending on   Socket/ens4
Oct 06 20:52:09 xxx dhclient[662]: Created duid "\000\001\000\001(\360\310\371B\001\012\335\012\016".
Oct 06 20:52:09 xxx google_guest_agent[572]: 2021/10/06 20:52:09 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
Oct 06 20:52:10 xxx groupadd[658]: group added to /etc/group: name=google-sudoers, GID=1001
Oct 06 20:52:10 xxx groupadd[658]: group added to /etc/gshadow: name=google-sudoers
Oct 06 20:52:10 xxx groupadd[658]: new group: name=google-sudoers, GID=1001
Oct 06 20:52:10 xxx GCEGuestAgent[572]: 2021-10-06T20:52:10.7259Z GCEGuestAgent Info: Created google sudoers file
Oct 06 20:52:10 xxx GCEGuestAgent[572]: 2021-10-06T20:52:10.7262Z GCEGuestAgent Info: Adding existing user root to google-sudoers group.
Oct 06 20:52:10 xxx gpasswd[680]: user root added by root to group google-sudoers
Oct 06 20:52:10 xxx GCEGuestAgent[572]: 2021-10-06T20:52:10.7489Z GCEGuestAgent Info: Updating keys for user root.
Oct 06 20:52:11 xxx google_guest_agent[572]: 2021/10/06 20:52:11 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
Oct 06 20:53:17 xxx GCEGuestAgent[572]: 2021-10-06T20:53:17.5595Z GCEGuestAgent Info: GCE Agent Stopped
Oct 06 20:53:17 xxx systemd[1]: Stopping Google Compute Engine Guest Agent...
-- Subject: A stop job for unit google-guest-agent.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has begun execution.
--
-- The job identifier is 1133.
Oct 06 20:53:17 xxx systemd[1]: google-guest-agent.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit google-guest-agent.service has successfully entered the 'dead' state.
Oct 06 20:53:17 xxx systemd[1]: Stopped Google Compute Engine Guest Agent.
-- Subject: A stop job for unit google-guest-agent.service has finished
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has finished.
--
-- The job identifier is 1133 and the job result is done.
Oct 06 20:53:17 xxx systemd[1]: Starting Google Compute Engine Guest Agent...
-- Subject: A start job for unit google-guest-agent.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit google-guest-agent.service has begun execution.
--
-- The job identifier is 1133.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.5816Z GCEGuestAgent Info: GCE Agent Started (version 20210629.00-0ubuntu1~20.04.0)
Oct 06 20:53:17 xxx systemd[1]: Started Google Compute Engine Guest Agent.
-- Subject: A start job for unit google-guest-agent.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit google-guest-agent.service has finished successfully.
--
-- The job identifier is 1133.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.7284Z GCEGuestAgent Info: Updating keys for user root.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.7472Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart nscd.service: Unit nscd.service not found.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.7716Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart unscd.service: Unit unscd.service not found.                                                                                      
Oct 06 20:53:17 xxx dhclient[2231]: Internet Systems Consortium DHCP Client 4.4.1
Oct 06 20:53:17 xxx dhclient[2231]: Copyright 2004-2018 Internet Systems Consortium.
Oct 06 20:53:17 xxx dhclient[2231]: All rights reserved.
Oct 06 20:53:17 xxx dhclient[2231]: For info, please visit https://www.isc.org/software/dhcp/
Oct 06 20:53:17 xxx dhclient[2231]:
Oct 06 20:53:17 xxx dhclient[2231]: Listening on Socket/ens4
Oct 06 20:53:17 xxx dhclient[2231]: Sending on   Socket/ens4
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.9344Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart cron.service: Unit cron.service not found.                                                                                                 
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.9421Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart crond.service: Unit crond.service not found.                                                                                                  
Oct 06 20:53:18 xxx google_guest_agent[2166]: 2021/10/06 20:53:18 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
Oct 06 20:58:06 xxx systemd[1]: Stopping Google Compute Engine Guest Agent...
-- Subject: A stop job for unit google-guest-agent.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has begun execution.
--
-- The job identifier is 2265.
Oct 06 20:58:06 xxx systemd[1]: google-guest-agent.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit google-guest-agent.service has successfully entered the 'dead' state.
Oct 06 20:58:06 xxx systemd[1]: Stopped Google Compute Engine Guest Agent.
-- Subject: A stop job for unit google-guest-agent.service has finished
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has finished.
--
-- The job identifier is 2265 and the job result is done.

systemd-analyze plot with the inactive (dead) state: systemd-analyze-plot-boot-problem.svg.gz

Logs after the virtual machine created by MIG is manual restarted (shutdown -r now).

systemctl status google-guest-agent.service

root@xxx:~# systemctl status google-guest-agent.service
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-10-06 21:22:59 UTC; 6min ago
   Main PID: 440 (google_guest_ag)
      Tasks: 9 (limit: 4403)
     Memory: 20.3M
     CGroup: /system.slice/google-guest-agent.service
             └─440 /usr/bin/google_guest_agent

Oct 06 21:22:59 xxx dhclient[580]: Listening on Socket/ens4
Oct 06 21:22:59 xxx dhclient[580]: Sending on   Socket/ens4
Oct 06 21:22:59 xxx GCEGuestAgent[440]: 2021-10-06T21:22:59.6917Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart nscd.service: Unit nscd.service not found.
                                                               .
Oct 06 21:22:59 xxx GCEGuestAgent[440]: 2021-10-06T21:22:59.7090Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart unscd.service: Unit unscd.service not found.
                                                               .
Oct 06 21:22:59 xxx GCEGuestAgent[440]: 2021-10-06T21:22:59.9446Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart crond.service: Unit crond.service not found.
                                                               .
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Refreshing passwd entry cache
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Refreshing group entry cache
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Failure getting groups, quitting
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Failed to get groups, not updating group cache file, removing /etc/oslogin_group.cache.bak.
Oct 06 21:23:00 xxx google_guest_agent[440]: 2021/10/06 21:23:00 logging client: rpc error: code = PermissionDenied desc = Cloud Logging API has not been used in project 407489596486 before or it is disabled. Enable it by visiting >

systemd-analyze verify google-guest-agent.service

root@xxx:~# systemd-analyze verify google-guest-agent.service
snap-snapd-13170.mount: Unit is bound to inactive unit dev-loop1.device. Stopping, too.
snap-core18-2128.mount: Unit is bound to inactive unit dev-loop2.device. Stopping, too.

systemd-analyze critical-chain google-guest-agent.service

root@xxx:~# systemd-analyze critical-chain google-guest-agent.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

google-guest-agent.service +2.203s
└─rsyslog.service @6.174s +104ms
  └─basic.target @6.079s
    └─sockets.target @6.067s
      └─snapd.socket @5.979s +63ms
        └─sysinit.target @5.823s
          └─cloud-init.service @5.028s +758ms
            └─systemd-networkd-wait-online.service @3.220s +1.797s
              └─systemd-networkd.service @3.156s +53ms
                └─network-pre.target @3.143s
                  └─cloud-init-local.service @1.893s +1.239s
                    └─systemd-udev-trigger.service @777ms +182ms
                      └─systemd-udevd-kernel.socket @659ms
                        └─system.slice @392ms
                          └─-.slice @392ms

systemd-analyze plot with the active (running) state: systemd-analyze-plot-boot-ok.svg.gz

google-guest-agent.service dependency graph

dependency-graph

Reproduction steps

  • Create a image with packer using ubuntu-os-cloud/ubuntu-minimal-2004-lts
  • Add the image in a MIGs template
  • Launch the image in the MIGs
  • Try to connect with OsLogin

Workaround

  • Create a /etc/rc.local script with the command /usr/bin/systemctl restart google-guest-agent.

I know this isn't the most elegant way to fix the problem.

I had the same behavior using version 20210414.00-0ubuntu1~20.04.0 of google-guest-agent.

I believe it is not an agent-related issue but I don't know enough about this project to continue debugging the problem by myself

Please let me know if there is any additional information I can provide that will be helpful.

Thanks,

FWIW there's some further discussion/diagnosis of the underlying cause of this issue tracked in https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1938299. Seems to be a cloud-init bug.

Thanks for the detailed report, @lborguetti
As mentioned by @jonahbull this is an issue with cloud-init, and being tracked in their system, and we will update this issue once a fix is released.

Thanks for the report @lborguetti, I'm having a similar issue but in my case, the load balancer routes are not created and therefore I cannot send traffic to the instances.

Thanks for the update @hopkiw

I think google-guest-agent is a critical service and maybe it should have a fallback to just not depends on OS boot. In the future other dependency failures may cause the same behavior.

Any updates on this?
According to https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1938299, the bug on cloud init has been fixed, but there is another one in google-guest-agent which is still confirmed.