GoogleCloudPlatform / guest-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error watching metadata: invalid character '<' looking for beginning of value

mschewe opened this issue · comments

Regularly we are seeing the following error in our logs. We are using the GCP in europe-west-1. We are running our own image based on the Google Cloud Debian 12 image. It's built with packer and has no configuration changes regarding networking or the guest-agent.

Error watching metadata: invalid character '<' looking for beginning of value

I've also been seeing this issue

  • VM-A: 2023-10-12 and 2023-11-05
  • VM-B: 2023-11-05 and 2023-11-07

Here's the log message:

{
  "insertId": "REDACTED",
  "jsonPayload": {
    "omitempty": null,
    "localTimestamp": "2023-11-07T14:52:01.3722Z",
    "message": "Error watching metadata: invalid character '<' looking for beginning of value"
  },
  "resource": {
    "type": "gce_instance",
    "labels": {
      "zone": "us-east5-a",
      "project_id": "REDACTED",
      "instance_id": "REDACTED"
    }
  },
  "timestamp": "2023-11-07T14:52:01.372680769Z",
  "severity": "ERROR",
  "labels": {
    "instance_name": "REDACTED"
  },
  "logName": "projects/REDACTED/logs/GCEGuestAgent",
  "sourceLocation": {
    "file": "metadata.go",
    "line": "74",
    "function": "github.com/GoogleCloudPlatform/guest-agent/google_guest_agent/events/metadata.(*Watcher).Run"
  },
  "receiveTimestamp": "2023-11-07T14:52:02.433931405Z"
}

Here's an excerpt from syslog from today's incident on VM-B:

Nov  7 14:48:50 debian dhclient[908]: RCV: Advertise message on ens4 from fe80::4001:aff:feca:1.
Nov  7 14:48:50 debian dhclient[908]: Packet received, but nothing done with it.
Nov  7 14:50:42 debian dhclient[472]: XMT: Solicit on ens4, interval 128610ms.
Nov  7 14:50:42 debian dhclient[908]: RCV: Advertise message on ens4 from fe80::4001:aff:feca:1.
Nov  7 14:50:42 debian dhclient[908]: Packet received, but nothing done with it.
Nov  7 14:50:51 debian systemd[1]: Starting GCE Workload Certificate refresh...
Nov  7 14:50:51 debian gce_workload_cert_refresh[35598]: 2023/11/07 14:50:51: Error getting config status, workload certificates may not be configured: failed to GET "instance/gce-workload-certificates/config-status" from MDS with error: error connecting to metadata server, status code: 404
Nov  7 14:50:51 debian gce_workload_cert_refresh[35598]: 2023/11/07 14:50:51: Done
Nov  7 14:50:51 debian systemd[1]: gce-workload-cert-refresh.service: Succeeded.
Nov  7 14:50:51 debian systemd[1]: Finished GCE Workload Certificate refresh.
Nov  7 14:52:01 debian google_guest_agent[526]: ERROR metadata.go:74 Error watching metadata: invalid character '<' looking for beginning of value
Nov  7 14:52:01 debian google_guest_agent[526]: Metadata event watcher failed, ignoring: invalid character '<' looking for beginning of value
Nov  7 14:52:01 debian google_guest_agent[526]: Metadata event watcher failed, ignoring: invalid character 'M' looking for beginning of value
Nov  7 14:52:01 debian google_guest_agent[526]: Metadata event watcher failed, ignoring: invalid character 'M' looking for beginning of value
Nov  7 14:52:01 debian google_guest_agent[526]: Metadata event watcher failed, ignoring: invalid character 'M' looking for beginning of value
Nov  7 14:52:01 debian google_guest_agent[526]: Metadata event watcher failed, ignoring: invalid character 'M' looking for beginning of value
[repeated another 579 times]

OS image is stock debian-cloud/debian-11, and guest-agent is on the latest version

root@VM-B:# uname -a
Linux VM-B 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
root@VM-B# dpkg -l | grep -i guest
ii  google-compute-engine                 1:20230801.00-g1               all          Google Compute Engine guest environment.
ii  google-guest-agent                    1:20231004.02-g1               amd64        Google Compute Engine Guest Agent
root@VM-B:/home/tadhunt# 

@tadhunt @mschewe

I have the same issue on my debian 11 instance ( stock) running in Google Cloud:

However, I have a different google-guest-agent version than you do.

Before I ran commands to upgrade packages:

uname -a
Linux [redacted] 3 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux

dpkg -l | grep -i guest
ii  google-compute-engine                 1:20230801.00-g1               all          Google Compute Engine guest environment.
ii  google-guest-agent                    1:20231115.00-g1               amd64        Google Compute Engine Guest Agent

Commands used:

apt-get install --only-upgrade google-cloud-cli
apt-get install --only-upgrade google-cloud-packages-archive-keyring
apt-get install --only-upgrade google-osconfig-agent
apt-get install --only-upgrade tzdata

After I ran the commands to upgrade packages:

uname -a
Linux [Redacted] 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux


dpkg -l | grep -i guest
ii  google-compute-engine                 1:20230801.00-g1               all          Google Compute Engine guest environment.
ii  google-guest-agent                    1:20231115.00-g1               amd64        Google Compute Engine Guest Agent

More information:
Result of apt list –upgradable

image

Let me know if you need more information to troubleshoot.

I just did the upgrade today. So I am going to leave my instance running for 48 hours and look at the serial console output to see if I can reproduce the error.

I hope this helps.

@mschewe @tadhunt after 72 hours, I do not get the error. Not sure what resolved it.

Again, I am more than happy to debug further if you want me to run certain commands in my Debian 11 instance or in Google Cloud Platform.

The code listed below is a snippet from the serial console output via Google Cloud Platform. It is normal.

google_guest_agent[481]: Scheduler - wake: [now 2023-12-18 12:10:48.129313902 -0600 CST]
google_guest_agent[481]: Scheduler - run: [now 2023-12-18 12:10:48.129313902 -0600 CST entry 1 next 2023-12-19 12:10:48 -0600 CST]
google_guest_agent[481]: Invoking job "telemetryJobID"

@tadhunt @rpp293 Thank you very much for helping to resolve this issue.
I will try updating the GoogleGuestAgent to the latest version. I will let you know if this solves the problem. This problem occurs in bursts. I already had a few days between the occurrences.

Here, the last 14 days
image

The underlying problem is that the metadata server can't be reached occasionally.
As the logs state, the server only tries to contact that server once per day. This might make the problem more significant than I thought. If the servers only contact the metadata server every 24 hours, many calls fail. I did not have a look into the code if this retries.

@tadhunt @rpp293 Thank you very much for helping to resolve this issue. I will try updating the GoogleGuestAgent to the latest version. I will let you know if this solves the problem. This problem occurs in bursts. I already had a few days between the occurrences.

Here, the last 14 days image

The underlying problem is that the metadata server can't be reached occasionally. As the logs state, the server only tries to contact that server once per day. This might make the problem more significant than I thought. If the servers only contact the metadata server every 24 hours, many calls fail. I did not have a look into the code if this retries.

@tadhunt @mschewe

No problem.

Quick question, what are the steps you are taking to generate the histogram ?

Would you be able to write them out ?

I am new to Google cloud platform and do not know all of the features yet.

Maybe I can analyze the histogram on my end for my instance to see if we can come up with the same result or similar result.

Greatly appreciated.

@rpp293 Sure, no problem. Go to the Google Cloud Console. Search from "Logs Explorer".
Select the timeframe. Enter the query:

severity=ERROR
jsonPayload.message="Error watching metadata: invalid character '<' looking for beginning of value"

and hit the "Run Query" button.

image

Edit: I created histograms for all our environments
prod
staging
testing

This looks like a problem in the infrastructure. Please let me know what your thoughts are. Do you have similar error patterns?

@mschewe

Sorry for the late response.

Thank you for the tutorial.

I left my VM on for 8 days and I do not have any similar error patterns.

I searched for the following jsonPayload.message and Error watching metadata but did not receive any results in the logs explorer.

Screenshot 2023-12-28 2228411

Closing this as it does not look like an active issue anymore.

Getting the same error in all our Windows servers. Any specific version to install to fix the issue?