spidernet-io / egressgateway

Network egress policy for Kubernetes

Home Page:https://spidernet-io.github.io/egressgateway/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Night CI 2024-01-30: Failed -- after restarted k8s components, its timeout waiting daemonset ready

weizhoublue opened this issue · comments

@bzsuni 我查看这个 case,怀疑 shutdown node 的 case正在运行 create DaemonSet 的 case 碰到一起。因为 check ds 全部 ready 的条件,导致提示超时。因为这时候假设 3 个 node,被 down 掉一个,那么 ds 的 status 2/3 Ready

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc0005a1470>: 
      create DaemonSet time out
      {
          s: "create DaemonSet time out",
      }
  occurred
  In [BeforeEach] at: /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:70 
• [FAILED] [30.964 seconds]
Reliability [Reliability]
/home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:30
  Test the drift of the EIP [BeforeEach]
  /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:43
    restart components [R00007]
    /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:225
      restart kube-controller-manager
      /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:293

  Timeline >>
  > Enter [BeforeEach] Test the drift of the EIP - /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:43 @ 01/30/24 21:01:18.749
  succeeded to create the gateway: egw-1bfce234-0554-4941-a010-f0d9816fc2a1
  v4DefaultEip: 172.18.6.3, v6DefaultEip: fc00:f853:ccd:e793::602
  Automatically polling progress:
    Reliability Test the drift of the EIP restart components restart kube-controller-manager (Spec Runtime: 20.001s)
      /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:293
      In [BeforeEach] (Node Runtime: 20s)
        /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:43

      Spec Goroutine
      goroutine 3191 [sleep]
        time.Sleep(0x1dcd6500)
          /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/time.go:195
        github.com/spidernet-io/egressgateway/test/e2e/common.CreateDaemonSet({0x2973030, 0xc0000501b8}, {0x297ac00, 0xc000145e60}, {0xc0009a4e80, 0x33}, {0xc000046246, 0x2e}, 0x0?)
          /home/runner/work/egressgateway/egressgateway/test/e2e/common/ds.go:71
      > github.com/spidernet-io/egressgateway/test/e2e/reliability_test.glob..func3.1.1()
          /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:69
            | 
            | // daemonSet
            > daemonSet, err = common.CreateDaemonSet(ctx, cli, "ds-reliability-"+uuid.NewString(), config.Image, time.Minute/2)
            | Expect(err).NotTo(HaveOccurred())
            | GinkgoWriter.Printf("succeeded to create DaemonSet: %s\n", daemonSet.Name)
        github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x0, 0x0})
          /home/runner/work/egressgateway/egressgateway/vendor/github.com/onsi/ginkgo/v2/internal/node.go:463
        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
          /home/runner/work/egressgateway/egressgateway/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:889
        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode
          /home/runner/work/egressgateway/egressgateway/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:876
  [FAILED] Unexpected error:
      <*errors.errorString | 0xc0005a1470>: 
      create DaemonSet time out
      {
          s: "create DaemonSet time out",
      }
  occurred
  In [BeforeEach] at: /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:70 @ 01/30/24 21:01:49.046

@bzsuni 我查看这个 case,怀疑 shutdown node 的 case正在运行 create DaemonSet 的 case 碰到一起。因为 check ds 全部 ready 的条件,导致提示超时。因为这时候假设 3 个 node,被 down 掉一个,那么 ds 的 status 2/3 Ready

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc0005a1470>: 
      create DaemonSet time out
      {
          s: "create DaemonSet time out",
      }
  occurred
  In [BeforeEach] at: /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:70 

image
Describe 中 加了 Serial,之下的所有 case 应该串行才对。
并且 AfterEach 中 加了 PowerOnNodesUntilClusterReady 全部开机,并且等待所有 pod ready
这里有点奇怪,还需要再看看

some reality indicates that restarting node is not a reliable approach
is it possible just to restart the components such as api-server ?

@lou-lan @bzsuni any update on this ?

@lou-lan @bzsuni any update on this ?

It hasn't reappeared recently.

Create DaemonSet Timeout:Kwok mock 节点会存在 NotReady。测试在创建一个 DaemonSet 的时候,会使用一个限制时间(比如 30s)去检查每个节点的 Pod 在该时间是否是 Running。

This stage has been fixed, leaving only one #1328