Cannot delete a deleted container/process: not found
colinking opened this issue · comments
Description
We have run into an issue where containerd
gets stuck in a loop while attempting to delete a gVisor container.
This issue occurs when a container exits. We see logs showing containerd
failed to handle the TaskExit
event:
containerd[1861]: time="2022-05-02T15:02:37.104147570Z" level=error msg="Failed to handle exit event &TaskExit{ContainerID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,ID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,Pid:5475,ExitStatus:0,ExitedAt:2022-05-02 15:02:27.100771368 +0000 UTC,XXX_unrecognized:[],} for 353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507" error="failed to handle container TaskExit event: failed to stop container: context deadline exceeded: unknown"
This initial error stems from an (unrelated) transient networking issue. However, gVisor has still received this request and deleted the container. When containerd
backs off and retries, it ends up stuck in an error loop:
containerd[1861]: time="2022-05-02T15:02:42.659175219Z" level=error msg="Failed to handle backOff event &TaskExit{ContainerID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,ID:353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507,Pid:5475,ExitStatus:0,ExitedAt:2022-05-02 15:02:27.100771368 +0000 UTC,XXX_unrecognized:[],} for 353c068378b0cba441e7032de11b5bb7989b86142555be54b40f23bbb4d96507" error="failed to handle container TaskExit event: failed to stop container: cannot delete a deleted container/process: not found: unknown"
This second error comes from gVisor from deleted_state.go
and returns a wrapped errdefs.ErrNotFound
.
This error is passed back and handled by containerd
here. However, the error is not treated as an ErrNotFound
("failed to stop container") so containerd
continues retrying.
Steps to reproduce
We are encountering this issue with pods running in GKE Sandbox (1.21.10-gke.2000
) on cos_containerd
nodes.
runsc version
$ /home/containerd/usr/local/sbin/runsc --version
runsc version google-431402566
spec: 1.0.2-dev
docker version (if using docker)
$ crictl version
Version: 0.1.0
RuntimeName: containerd
RuntimeVersion: 1.4.8
RuntimeApiVersion: v1alpha2
$ ctr version
Client:
Version: 1.4.8
Revision: 7eba5930496d9bbe375fdf71603e610ad737d2b2
Go version: go1.13.5
Server:
Version: 1.4.8
Revision: 7eba5930496d9bbe375fdf71603e610ad737d2b2
UUID: a197db6e-c4fe-46c8-8e1c-0e723b8cbd70
uname
Linux gke-cluster-airplane-pool-public-2110-605384a7-482k 5.4.170+ #1 SMP Sat Mar 5 10:08:44 PST 2022 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux
kubectl (if using Kubernetes)
$ kc version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:41:58Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.10-gke.2000", GitCommit:"0823380786b063c3f71d5e7c76826a972e30550d", GitTreeState:"clean", BuildDate:"2022-03-17T09:22:22Z", GoVersion:"go1.16.14b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1
repo state (if built from source)
No response
runsc debug logs (if available)
No response
I'm not sure how nested Go errors are passed across gRPC, but my guess is that the wrapping gets lost in translation. We could try to massage the error to make containerd happy, or we could not error out in this case. Do you have a way to repro this problem?
@fvoznika thanks for taking a look! My sense is that the easiest approach here (and the one that changes the API behavior the least) is to just replace the error in deleted_state.go with a utils.ErrToGRPCf(errdefs.ErrNotFound, ...)
. It looks like a bunch of other errors were converted to this form in an earlier change.
Unfortunately, we don't have an easy way to reproduce this problem at the moment. As part of running our product, we start and terminate several thousand containers per hour, and this error happens only once or twice an hour. However, when it happens, it causes a bunch of downstream problems that we need to manually clean up, so we're eager to get it fixed.
Thanks, and let us know if there's any other information we can provide.
@fvoznika thanks for making that fix. I was wondering- why did you get rid of the code in pkg/shim/utils/errors.go
? Because gvisor is still pinned to containerd v1.3.9, the errdefs.ToGRPC
function being called uses errors.Cause
instead of errors.Is
to classify the error, and so I think the utility code actually helped avoid some translation problems.
Here's some code I used to test:
package main
import (
"errors"
"fmt"
// Using containerd pinned at v1.3.9 (same version as used by gvisor)
"github.com/containerd/containerd/errdefs"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
func main() {
// Error returned from 'func (*deletedState) Delete(context.Context)'
origErr := fmt.Errorf("cannot delete a deleted container/process: %w", errdefs.ErrNotFound)
// Using current logic
translatedErr := errdefs.ToGRPC(origErr)
containerdErr := errdefs.FromGRPC(translatedErr)
fmt.Println(
"Error as seen by containerd:",
containerdErr,
errors.Is(containerdErr, errdefs.ErrNotFound),
)
// What I think we want instead (i.e., what was done in the old utility function)?
translatedErr = status.Errorf(codes.NotFound, origErr.Error())
containerdErr = errdefs.FromGRPC(translatedErr)
fmt.Println(
"Error as seen by containerd:",
containerdErr,
errors.Is(containerdErr, errdefs.ErrNotFound),
)
}
The outputs are:
Error as seen by containerd: cannot delete a deleted container/process: not found: unknown false
Error as seen by containerd: cannot delete a deleted container/process: not found true
which has me worried that the containerd server (running at v1.5.4
) will still treat the deletion error as being "unknown".
Let me know if that makes sense, and apologies if I'm misunderstanding!
We have changed gVisor to use containerd 1.4 internally a while back. I started the change to update it externally too, but got distracted with other stuff. I'll push a change to update it. Thanks for checking!
If you patch #7574, you should see the following with your test:
Error as seen by containerd: cannot delete a deleted container/process: not found true
Error as seen by containerd: cannot delete a deleted container/process: not found true
@fvoznika ahh, makes sense now. Thanks!