GoogleCloudPlatform / guest-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Restart Agent when SystemD Network unit is restarted

ricbartm opened this issue · comments

Environment

OS: Ubuntu 20.04 LTS
Kernel: 5.4.0-1037-gcp #40-Ubuntu SMP Fri Feb 5 11:57:53 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
SystemD version: 245.4-4ubuntu3.5
Google Guess Agent version: 20201217.02-0ubuntu1~20.04.0

Problem

With the release of a security upgrade by Ubuntu on package systemd, the SystemD service systemd-networkd is restarted. This can make a GCP instance impaired for serving traffic.

When the systemd-networkd.service unit is restarted, the operating system local routing table is wiped. This cause the local host routes for Google Cloud regional TCP Load Balancers to disappear and produce the following behavior:

  • The health checks, originated from the TCP LB service IP, start failing because the node does not have a host route for it
  • With all instances in a failed state, the TCP LB enters into an always-open state. The traffic directed to the TCP LB service IP is being dropped by the instances (never answer to the TCP SYN packet) because of the lack of the host route.

The triage for this issue is restarting the google-guest-agent.service SystemD unit, so host routes are added back and both health checks and traffic start working again.

Reproduction steps

  1. Create a TCP regional LB in a given region (does not matter if the public IP is static or ephemeral)
  2. Configure a GCP instance in the same region as a backend instance. Configure a basic TCP health check on a TCP port that is wide open
  3. Configure a frontend listener on port 80 using an ephemeral IP
  4. Wait for it to be created
  5. SSH to the instance and verify that TCP LB ephemeral IP is listed as host route in the output of ip ro list table local
  6. Restart systemd-networkd using systemd restart systemd-networkd
  7. Check the local route table again and verify the route is no longer there.

At this point, the route won't be re-added. You need to restart the google-guest-agent.service SystemD unit to the routes to be re-added.

Solution

The systemd-networkd.service unit is not listed as part of the PartOf directive in the Google Guest Agent service unit configuration. See https://github.com/GoogleCloudPlatform/guest-agent/blob/master/google-guest-agent.service#L7

There is an item in the PartOf for networking.service, but this systemd unit is managed by ifupdown package. In this specific user case, SystemD is also network managed and we'll need to consider it like that in the google-guest-agent.service configuration.

We've discovered this issue & fix in an incident today on GitLab.com SaaS, sharing the RCA for visibility: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5196#note_632054352