actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MessageQueueToken error - failed to start message queue listener

kwngo opened this issue · comments

Checks

Controller Version

0.7.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

We're seeing issues after runner pods have been running for a long time, where the listener seems to fail and terminates/fails to bring up new runner pods. The error we're seeing in the listener is below which seems to point to an issue refreshing the messagequeue token. It seems to require us to delete the full helm release and re-install to get things working again.

2024-06-11T18:39:56Z	ERROR	Error encountered	{"error": "failed to start message queue listener: could not get and process message. get message failed from refreshing client. get message failed. Get \"https://pipelinesghubeus11.actions.githubusercontent.com/FQqfquGw3l7Fd5bL7OmYmUPBAsddiKs6nVqQWRmLn7iZpNMauk/_apis/runtime/runnerscalesets/40/messages?api-version=6.0-preview&lastMessageId=17&sessionId=41843c95-57a1-4b08-bdc7-25db85b6c9bb\": GET https://pipelinesghubeus11.actions.githubusercontent.com/FQqfquGw3l7Fd5bL7OmYmUPBAsddiKs6nVqQWRmLn7iZpNMauk/_apis/runtime/runnerscalesets/40/messages?api-version=6.0-preview&lastMessageId=17&sessionId=41843c95-57a1-4b08-bdc7-25db85b6c9bb giving up after 1 attempt(s): context canceled"}

Describe the bug

Runner scale set listener fails to refresh and gets stuck.

Describe the expected behavior

Runner listener should be able to refresh its client

Additional Context

# Default values for gha-runner-scale-set-controller.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
labels: {}

# leaderElection will be enabled when replicaCount>1,
# So, only one replica will in charge of reconciliation at a given time
# leaderElectionId will be set to {{ define gha-runner-scale-set-controller.fullname }}.
replicaCount: 1

image:
  repository: "ghcr.io/actions/gha-runner-scale-set-controller"
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

env:
## Define environment variables for the controller pod
#  - name: "ENV_VAR_NAME_1"
#    value: "ENV_VAR_VALUE_1"
#  - name: "ENV_VAR_NAME_2"
#    valueFrom:
#      secretKeyRef:
#        key: ENV_VAR_NAME_2
#        name: secret-name
#        optional: true

serviceAccount:
  # Specifies whether a service account should be created for running the controller pod
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  # You can not use the default service account for this.
  name: ""

podAnnotations: {}

podLabels: {}

podSecurityContext: {}
# fsGroup: 2000

securityContext: {}
# capabilities:
#   drop:
#   - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000

resources: {}
## We usually recommend not to specify default resources and to leave this as a conscious
## choice for the user. This also increases chances charts run on environments with little
## resources, such as Minikube. If you do want to specify resources, uncomment the following
## lines, adjust them as necessary, and remove the curly braces after 'resources:'.
# limits:
#   cpu: 100m
#   memory: 128Mi
# requests:
#   cpu: 100m
#   memory: 128Mi

nodeSelector: {}

tolerations: []

affinity: {}

# Leverage a PriorityClass to ensure your pods survive resource shortages
# ref: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/
# PriorityClass: system-cluster-critical
priorityClassName: ""

## If `metrics:` object is not provided, or commented out, the following flags 
## will be applied the controller-manager and listener pods with empty values: 
## `--metrics-addr`, `--listener-metrics-addr`, `--listener-metrics-endpoint`. 
## This will disable metrics.
##
## To enable metrics, uncomment the following lines.
# metrics:
#   controllerManagerAddr: ":8080"
#   listenerAddr: ":8080"
#   listenerEndpoint: "/metrics"

flags:
  ## Log level can be set here with one of the following values: "debug", "info", "warn", "error".
  ## Defaults to "debug".
  logLevel: "debug"
  ## Log format can be set with one of the following values: "text", "json"
  ## Defaults to "text"
  logFormat: "text"

  ## Restricts the controller to only watch resources in the desired namespace.
  ## Defaults to watch all namespaces when unset.
  # watchSingleNamespace: ""

  ## Defines how the controller should handle upgrades while having running jobs.
  ##
  ## The strategies available are:
  ## - "immediate": (default) The controller will immediately apply the change causing the
  ##   recreation of the listener and ephemeral runner set. This can lead to an
  ##   overprovisioning of runners, if there are pending / running jobs. This should not
  ##   be a problem at a small scale, but it could lead to a significant increase of
  ##   resources if you have a lot of jobs running concurrently.
  ##
  ## - "eventual": The controller will remove the listener and ephemeral runner set
  ##   immediately, but will not recreate them (to apply changes) until all
  ##   pending / running jobs have completed.
  ##   This can lead to a longer time to apply the change but it will ensure
  ##   that you don't have any overprovisioning of runners.
  updateStrategy: "immediate"

Controller Logs

https://gist.github.com/kwngo/fde29716162d836420861569ef4cd8e4

Runner Pod Logs

https://gist.github.com/kwngo/066db52346c26556a4948681b4163ee0

Looks like this may have been a consequence of the Github wide issue with the API and Github Actions

Im not sure if the issues I saw today are related to this, but after the outage today, all of my runners got stuck in a bad state and have since been completely unrecoverable. I had to create an entirely new runner set to try and get CI back up and running properly

Hey everyone, I'm fairly certain this one was due to an incident, so I will close it now. Please let us know if the issue persists ☺️