containernetworking / cni

Container Network Interface - networking for Linux containers

Home Page:https://cni.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failed to destroy the network because of empty cached cni result.

sofat1989 opened this issue · comments

failed to "KillPodSandbox" for "1b77faa7-5bc3-4b87-a56b-97918872b6f4" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"5bae61e683f7e59902f458514e9af1515d1c1c71ff069e9d4a2008a011307d04\": failed to get network \"generic-veth\" cached result: decoding version from network config: unexpected end of JSON input"


276776240 -rw------- 1 root root    0 Nov 27 23:16 generic-veth-5bae61e683f7e59902f458514e9af1515d1c1c71ff069e9d4a2008a011307d04-eth1
276779765 -rw------- 1 root root    0 Nov 27 23:16 istio-cni-5bae61e683f7e59902f458514e9af1515d1c1c71ff069e9d4a2008a011307d04-eth2
reboot   system boot  5.15.0-26-generi Mon Nov 27 23:20 - 00:10 (30+00:49)
reboot   system boot  5.15.0-26-generi Mon Nov 27 23:16 - 00:10 (30+00:53)
reboot   system boot  5.15.0-26-generi Mon Nov 27 23:13 - 00:10 (30+00:56)

Recently we met an issue. When the containerd is adding the networks for pods, but at that time, the system is rebooted. It will leave some empty files under /var/lib/cni/results. And after the system is up again, but the pods are deleted by the users. containerd will try to read the cached result to destroy the networks. It will fail directly.

For the cached file, the data will be put in the prevResult field. It seems that a lot of plugins only care about the previous result of the other plugins chained together. And it doesn't care about the previous result of itself.

Can we delay reporting the error of cached result and try to delete the network with cachedResult=nil ?

// DelNetwork executes the plugin with the DEL command
 func (c *CNIConfig) DelNetwork(ctx context.Context, net *NetworkConfig, rt *RuntimeConf) error {
        var cachedResult types.Result
+       var cachedResultErr error

        // Cached result on DEL was added in CNI spec version 0.4.0 and higher
        if gtet, err := version.GreaterThanOrEqualTo(net.Network.CNIVersion, "0.4.0"); err != nil {
@@ -567,12 +571,14 @@ func (c *CNIConfig) DelNetwork(ctx context.Context, net *NetworkConfig, rt *Runt
        } else if gtet {
                cachedResult, err = c.getCachedResult(net.Network.Name, net.Network.CNIVersion, rt)
                if err != nil {
-                       return fmt.Errorf("failed to get network %q cached result: %w", net.Network.Name, err)
+                       cachedResultErr = fmt.Errorf("failed to get network %q cached result: %w", net.Network.Name, err)
                }
        }
-
        if err := c.delNetwork(ctx, net.Network.Name, net.Network.CNIVersion, net, cachedResult, rt); err != nil {
-               return err
+               if cachedResultErr != nil {
+                       return fmt.Errorf("plugin %s failed (delete): %w; %w", pluginDescription(net.Network), cachedResultErr, err)
+               }
+               return fmt.Errorf("plugin %s failed (delete): %w", pluginDescription(net.Network), err)
        }
        _ = c.cacheDel(net.Network.Name, rt)
        return nil

Oops! That's not good!

Delete needs to be extremely robust to failure. We clearly need to fix this.

I'll be happy to contribute a fix. IMHO there are a few options, some of them were mentioned in #1066 discussions too.

  1. Introduce an optional syncfs syscall to flush the FS on every cache file writes. But it would potentially degrade the overall system performance;
  2. Take the proposal in this issue description, try to delete the network with empty cachedResult as the best effort.
  3. I'm open to other suggestions too.

Thanks!

Discussed during community meeting. The decision was 2nd option, "try to delete the network with empty cachedResult as the best effort". PR #1072.