google-github-actions / setup-gcloud

A GitHub Action for installing and configuring the gcloud CLI.

Home Page:https://cloud.google.com/sdk/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Setup gcloud shuts a self-hosted actions/actions-runner

arkadyb opened this issue · comments

TL;DR

google-github-actions/setup-gcloud@v2 fails in self hosted runners

Expected behavior

gcloud cli installed

Observed behavior

The pipeline's step freezes for about 35 seconds and then shuts down with the error:
##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

Nothing else in the logs.

Action YAML

...
      - name: Checkout
        uses: actions/checkout@v4

      - id: auth
        uses: "google-github-actions/auth@v2"
        with:
          credentials_json: "${{ secrets.GCP_CREDENTIALS }}"

      # Setup gcloud CLI
      - uses: google-github-actions/setup-gcloud@v2
...

Log output

2024-03-12T01:33:13.9350969Z ##[group]Run google-github-actions/setup-gcloud@v2
2024-03-12T01:33:13.9355631Z with:
2024-03-12T01:33:13.9357709Z   skip_install: false
2024-03-12T01:33:13.9360391Z   version: latest
2024-03-12T01:33:13.9362953Z env:
<here goes a list of env vars>
2024-03-12T01:33:13.9428275Z ##[endgroup]
2024-03-12T01:33:25.2932104Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /home/runner/_work/_temp/a7674f07-d419-4638-b721-9deb867abd28 -f /home/runner/_work/_temp/74c55beb-8172-419f-a122-23adc1b632fb
2024-03-12T01:33:52.9093344Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2024-03-12T01:33:53.0145954Z Cleaning up orphan processes

Additional information

I got a weird behavior with not much on logs, but the self-hosted actions/actions-runner container just exits (receives a shutdown signal) whenever i try to install gcloud cli.

For security reasons we started with a self-hosted runners. Run through a quick starts and used the two helm charts to install the controller and the runner set as described here.

With the actions steps outline above in the Action YAML section, the pipeline just crushes.

Hi there - could you please provide the debug output for the complete GitHub Actions workflow run and the GitHub Actions runner (there are two environment variables to set)?

Hey @sethvargo ,

Note: here i use the autopilot type of K8s cluster on GKE.

Here is the log from the workflow's step:

 google-cloud-sdk/platform/gsutil/third_party/urllib3/test/with_dummyserver/test_poolmanager.py
google-cloud-sdk/platform/gsutil/third_party/urllib3/test/with_dummyserver/test_proxy_poolmanager.py
google-cloud-sdk/platform/gsutil/third_party/urllib3/test/with_dummyserver/test_socketlevel.py
google-cloud-sdk/properties
google-cloud-sdk/rpm/mapping/command_mapping.yaml
google-cloud-sdk/rpm/mapping/component_mapping.yaml
##[debug]Caching tool gcloud 467.0.0 x64
##[debug]source dir: /home/runner/_work/_temp/d1568323-a31d-436b-ac46-c573872b353f/google-cloud-sdk
##[debug]destination /home/runner/_work/_tool/gcloud/467.0.0/x64
##[debug]CLOUDSDK_METRICS_ENVIRONMENT='github-actions-setup-gcloud'
##[debug]CLOUDSDK_METRICS_ENVIRONMENT_VERSION='2.1.0'
Error: The operation was canceled.
##[debug]System.OperationCanceledException: The operation was canceled.
##[debug]   at System.Threading.CancellationToken.ThrowOperationCanceledException()
##[debug]   at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Worker.Handlers.NodeScriptActionHandler.RunAsync(ActionRunStage stage)
##[debug]   at GitHub.Runner.Worker.ActionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Run google-github-actions/setup-gcloud@v2

From the log it looks like it just downloading the tool and then hit by the OperationCancelledException as if there would be something that cancelled the process externally.

Here is the runner container logs:

[WORKER 2024-03-12 23:00:55Z INFO JobServerQueue] Try to append 2 batches web console lines for record 'e5bbf9a0-fd17-5014-01a5-381f9c676dbc', success rate: 2/2.
[WORKER 2024-03-12 23:00:56Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2024-03-12 23:00:56Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2024-03-12 23:00:56Z INFO HostContext] Well known directory 'Work': '/home/runner/_work'
[WORKER 2024-03-12 23:00:56Z INFO JobServerQueue] Try to append 2 batches web console lines for record 'e5bbf9a0-fd17-5014-01a5-381f9c676dbc', success rate: 2/2.
[WORKER 2024-03-12 23:00:56Z INFO JobServerQueue] Try to append 2 batches web console lines for record 'e5bbf9a0-fd17-5014-01a5-381f9c676dbc', success rate: 2/2.
[WORKER 2024-03-12 23:00:57Z INFO JobServerQueue] Try to append 3 batches web console lines for record 'e5bbf9a0-fd17-5014-01a5-381f9c676dbc', success rate: 3/3.
[WORKER 2024-03-12 23:00:57Z INFO JobServerQueue] Try to append 3 batches web console lines for record 'e5bbf9a0-fd17-5014-01a5-381f9c676dbc', success rate: 3/3.
Exiting runner...
[WORKER 2024-03-12 23:00:58Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[WORKER 2024-03-12 23:00:58Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[RUNNER 2024-03-12 23:00:58Z INFO Terminal] WRITE LINE: Exiting...
Exiting...
[RUNNER 2024-03-12 23:00:58Z INFO ConfigurationStore] IsServiceConfigured()
[RUNNER 2024-03-12 23:00:58Z INFO ConfigurationStore] IsServiceConfigured: False
[RUNNER 2024-03-12 23:00:58Z INFO HostContext] Runner will be shutdown for UserCancelled
[RUNNER 2024-03-12 23:00:58Z WARN GitHubActionsService] GET request to https://pipelinesghubeus3.actions.githubusercontent.com/UC7VdRuNjkxy4iPZQQYUQ2w2qvQudOlGg42SHkwyCxhzsTL5Ap/_apis/distributedtask/pools/1/messages?sessionId=8c3ba416-a837-49c6-9b5f-4f7fb8c203d1&lastMessageId=1&status=Busy&runnerVersion=2.314.1&os=Linux&architecture=X64&disableUpdate=true has been cancelled.
[RUNNER 2024-03-12 23:00:58Z INFO MessageListener] Get next message has been cancelled.
[RUNNER 2024-03-12 23:00:58Z INFO JobDispatcher] Shutting down JobDispatcher. Make sure all WorkerDispatcher has finished.
[RUNNER 2024-03-12 23:00:58Z INFO JobDispatcher] Ensure WorkerDispatcher for job 21d2ae92-bad9-51da-3a47-217a0af63fd8 run to finish, cancel any running job.
[RUNNER 2024-03-12 23:00:58Z INFO JobDispatcher] Send job cancellation message to worker for job 21d2ae92-bad9-51da-3a47-217a0af63fd8.
[RUNNER 2024-03-12 23:00:58Z INFO ProcessChannel] Sending message of length 0, with hash 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
[WORKER 2024-03-12 23:00:58Z INFO ProcessChannel] Receiving message of length 0, with hash 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
[WORKER 2024-03-12 23:00:58Z INFO Worker] Cancellation/Shutdown message received.
[WORKER 2024-03-12 23:00:58Z INFO HostContext] Runner will be shutdown for UserCancelled
[WORKER 2024-03-12 23:00:58Z INFO StepsRunner] Cancel current running step.
[WORKER 2024-03-12 23:00:58Z INFO ProcessInvokerWrapper] Sending SIGINT to process 96.
[WORKER 2024-03-12 23:00:58Z INFO ProcessInvokerWrapper] Successfully send SIGINT to process 96.
[WORKER 2024-03-12 23:00:58Z INFO ProcessInvokerWrapper] Waiting for process exit or 7.5 seconds after SIGINT signal fired.
[WORKER 2024-03-12 23:00:58Z INFO JobServerQueue] Try to append 4 batches web console lines for record 'e5bbf9a0-fd17-5014-01a5-381f9c676dbc', success rate: 4/4.
[WORKER 2024-03-12 23:00:58Z INFO JobServerQueue] Try to append 1 batches web console lines for record '21d2ae92-bad9-51da-3a47-217a0af63fd8', success rate: 1/1.
[WORKER 2024-03-12 23:00:58Z INFO ProcessInvokerWrapper] Process exit successfully.
[WORKER 2024-03-12 23:00:58Z INFO ProcessInvokerWrapper] Process cancelled successfully through Ctrl+C/SIGINT.
[WORKER 2024-03-12 23:00:58Z INFO JobServerQueue] Try to append 6 batches web console lines for record 'e5bbf9a0-fd17-5014-01a5-381f9c676dbc', success rate: 6/6.
[WORKER 2024-03-12 23:00:59Z INFO ProcessInvokerWrapper] Process Cancellation finished.
[WORKER 2024-03-12 23:00:59Z INFO ProcessInvokerWrapper] Finished process 96 with exit code 130, and elapsed time 00:01:19.2189707.
[WORKER 2024-03-12 23:00:59Z INFO CreateStepSummaryCommand] Step Summary file (/home/runner/_work/_temp/_runner_file_commands/step_summary_8b461e13-5d99-4132-a90d-f5fa511036e2) is empty; skipping attachment upload
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner] Caught cancellation exception from step: System.OperationCanceledException: The operation was canceled.
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner]    at System.Threading.CancellationToken.ThrowOperationCanceledException()
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner]    at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner]    at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner]    at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner]    at GitHub.Runner.Worker.Handlers.NodeScriptActionHandler.RunAsync(ActionRunStage stage)
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner]    at GitHub.Runner.Worker.ActionRunner.RunAsync()
[WORKER 2024-03-12 23:00:59Z ERR  StepsRunner]    at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
[WORKER 2024-03-12 23:00:59Z INFO StepsRunner] Step result: Canceled
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext] Publish step telemetry for current step {
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "action": "google-github-actions/setup-gcloud",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "ref": "v2",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "type": "node20",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "stage": "Main",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "stepId": "e5bbf9a0-fd17-5014-01a5-381f9c676dbc",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "stepContextName": "__google-github-actions_setup-gcloud",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "hasPreStep": false,
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "hasPostStep": false,
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "result": "canceled",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "errorMessages": [
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]     "The operation was canceled."
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   ],
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "executionTimeInSeconds": 80,
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "startTime": "2024-03-12T22:59:39.9304761Z",
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext]   "finishTime": "2024-03-12T23:00:59.3556491Z"
[WORKER 2024-03-12 23:00:59Z INFO ExecutionContext] }.
[WORKER 2024-03-12 23:00:59Z INFO StepsRunner] No need for updating job result with current step result 'Canceled'.

Hi @arkadyb - our ability debug custom runners is limited, since each company builds and maintains images in a unique way. Are you using the latest GitHub Actions runner?

Exit code 130 is SIGINT, so something is telling the runner process to stop. Are you running on spot VMs by chance? I think GCP sends 130 as the termination signal. Alternatively, is GKE killing the pod because of memory or CPU limitations?

[RUNNER 2024-03-12 23:00:58Z INFO HostContext] Runner will be shutdown for UserCancelled

This certainly makes it seem like an external process is cancelling the job. Can you share the complete action.yaml?

Hi @arkadyb - our ability debug custom runners is limited, since each company builds and maintains images in a unique way. Are you using the latest GitHub Actions runner?
We use the GH provided gha-runner-scale-set-controller and the gha-runner-scale-set as is.

We should not be running on SpotVMs, nope.

Alternatively, is GKE killing the pod because of memory or CPU limitations?

This is interesting guess. Let me try to debug it this way. Would you know by any chance how can i identify this would be the problem?

This is the entire thing i am testing on atm:

name: Build and Deploy to GKE

on:
  push:
    branches:
      - main

env:
  PROJECT_ID: ${{ secrets.GKE_PROJECT }}
  GAR_LOCATION: us-central1
  GKE_CLUSTER: autopilot-cluster-1 
  GKE_ZONE: us-central1
  DEPLOYMENT_NAME: nginx-deployment
  REPOSITORY: samples
  IMAGE: static-site

jobs:
  setup-build-publish-deploy:
    name: Setup, Build, Publish, and Deploy
    runs-on: arc-runner-set
    environment: production

    permissions:
      contents: "read"
      id-token: "write"

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      # Alternative option - authentication via credentials json
      - id: auth
        uses: "google-github-actions/auth@v2"
        with:
          credentials_json: "${{ secrets.GCP_CREDENTIALS }}"

      # Setup gcloud CLI
      - uses: google-github-actions/setup-gcloud@v2

Yey

9s          Warning   Evicted                pod/arc-runner-set-nkhpd-runner-m6fwr   Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.

Thanks for pointing @sethvargo ++++++

Sounds like you need to bump resource limits 😄