google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot delete a deleted container/process: not found

colinking opened this issue · comments

commented

Description

We have run into an issue where containerd gets stuck in a loop while attempting to delete a gVisor container.

This issue occurs when a container exits. We see logs showing containerd failed to handle the TaskExit event:

containerd[1861]: time="2022-05-02T15:02:37.104147570Z" level=error msg="Failed to handle exit event &TaskExit{ContainerID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,ID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,Pid:5475,ExitStatus:0,ExitedAt:2022-05-02 15:02:27.100771368 +0000 UTC,XXX_unrecognized:[],} for 353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507" error="failed to handle container TaskExit event: failed to stop container: context deadline exceeded: unknown"

This initial error stems from an (unrelated) transient networking issue. However, gVisor has still received this request and deleted the container. When containerd backs off and retries, it ends up stuck in an error loop:

containerd[1861]: time="2022-05-02T15:02:42.659175219Z" level=error msg="Failed to handle backOff event &TaskExit{ContainerID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,ID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,Pid:5475,ExitStatus:0,ExitedAt:2022-05-02 15:02:27.100771368 +0000 UTC,XXX_unrecognized:[],} for 353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507" error="failed to handle container TaskExit event: failed to stop container: cannot delete a deleted container/process: not found: unknown"

This second error comes from gVisor from deleted_state.go and returns a wrapped errdefs.ErrNotFound.

This error is passed back and handled by containerd here. However, the error is not treated as an ErrNotFound ("failed to stop container") so containerd continues retrying.

Steps to reproduce

We are encountering this issue with pods running in GKE Sandbox (1.21.10-gke.2000) on cos_containerd nodes.

runsc version

$ /home/containerd/usr/local/sbin/runsc --version
runsc version google-431402566
spec: 1.0.2-dev

docker version (if using docker)

$ crictl version
Version:  0.1.0
RuntimeName:  containerd
RuntimeVersion:  1.4.8
RuntimeApiVersion:  v1alpha2

$ ctr version
Client:
  Version:  1.4.8
  Revision: 7eba5930496d9bbe375fdf71603e610ad737d2b2
  Go version: go1.13.5

Server:
  Version:  1.4.8
  Revision: 7eba5930496d9bbe375fdf71603e610ad737d2b2
  UUID: a197db6e-c4fe-46c8-8e1c-0e723b8cbd70

uname

Linux gke-cluster-airplane-pool-public-2110-605384a7-482k 5.4.170+ #1 SMP Sat Mar 5 10:08:44 PST 2022 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

kubectl (if using Kubernetes)

$ kc version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:41:58Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.10-gke.2000", GitCommit:"0823380786b063c3f71d5e7c76826a972e30550d", GitTreeState:"clean", BuildDate:"2022-03-17T09:22:22Z", GoVersion:"go1.16.14b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1

repo state (if built from source)

No response

runsc debug logs (if available)

No response

I'm not sure how nested Go errors are passed across gRPC, but my guess is that the wrapping gets lost in translation. We could try to massage the error to make containerd happy, or we could not error out in this case. Do you have a way to repro this problem?

@fvoznika thanks for taking a look! My sense is that the easiest approach here (and the one that changes the API behavior the least) is to just replace the error in deleted_state.go with a utils.ErrToGRPCf(errdefs.ErrNotFound, ...). It looks like a bunch of other errors were converted to this form in an earlier change.

Unfortunately, we don't have an easy way to reproduce this problem at the moment. As part of running our product, we start and terminate several thousand containers per hour, and this error happens only once or twice an hour. However, when it happens, it causes a bunch of downstream problems that we need to manually clean up, so we're eager to get it fixed.

Thanks, and let us know if there's any other information we can provide.

@fvoznika thanks for making that fix. I was wondering- why did you get rid of the code in pkg/shim/utils/errors.go? Because gvisor is still pinned to containerd v1.3.9, the errdefs.ToGRPC function being called uses errors.Cause instead of errors.Is to classify the error, and so I think the utility code actually helped avoid some translation problems.

Here's some code I used to test:

package main

import (
	"errors"
	"fmt"

	// Using containerd pinned at v1.3.9 (same version as used by gvisor)
	"github.com/containerd/containerd/errdefs"

	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"
)

func main() {
	// Error returned from 'func (*deletedState) Delete(context.Context)'
	origErr := fmt.Errorf("cannot delete a deleted container/process: %w", errdefs.ErrNotFound)

	// Using current logic
	translatedErr := errdefs.ToGRPC(origErr)
	containerdErr := errdefs.FromGRPC(translatedErr)
	fmt.Println(
		"Error as seen by containerd:",
		containerdErr,
		errors.Is(containerdErr, errdefs.ErrNotFound),
	)

	// What I think we want instead (i.e., what was done in the old utility function)?
	translatedErr = status.Errorf(codes.NotFound, origErr.Error())
	containerdErr = errdefs.FromGRPC(translatedErr)
	fmt.Println(
		"Error as seen by containerd:",
		containerdErr,
		errors.Is(containerdErr, errdefs.ErrNotFound),
	)
}

The outputs are:

Error as seen by containerd: cannot delete a deleted container/process: not found: unknown false
Error as seen by containerd: cannot delete a deleted container/process: not found true

which has me worried that the containerd server (running at v1.5.4) will still treat the deletion error as being "unknown".

Let me know if that makes sense, and apologies if I'm misunderstanding!

We have changed gVisor to use containerd 1.4 internally a while back. I started the change to update it externally too, but got distracted with other stuff. I'll push a change to update it. Thanks for checking!

If you patch #7574, you should see the following with your test:

Error as seen by containerd: cannot delete a deleted container/process: not found true
Error as seen by containerd: cannot delete a deleted container/process: not found true

@fvoznika ahh, makes sense now. Thanks!