Night CI 2024-01-30: Failed -- after restarted k8s components, its timeout waiting daemonset ready

Question

Night CI 2024-01-30: Failed -- after restarted k8s components, its timeout waiting daemonset ready

weizhoublue opened this issue 6 months ago · comments

action url: https://github.com/spidernet-io/egressgateway/actions/runs/7716750992

lou-lan · Answer 1 · Wed Jan 31 2024 10:41:43 GMT+0800 (China Standard Time)

@bzsuni 我查看这个 case，怀疑 shutdown node 的 case 和 正在运行 create DaemonSet 的 case 碰到一起。因为 check ds 全部 ready 的条件，导致提示超时。因为这时候假设 3 个 node，被 down 掉一个，那么 ds 的 status 2/3 Ready。

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc0005a1470>: 
      create DaemonSet time out
      {
          s: "create DaemonSet time out",
      }
  occurred
  In [BeforeEach] at: /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:70

bzsuni · Answer 2 · Wed Jan 31 2024 13:43:56 GMT+0800 (China Standard Time)

• [FAILED] [30.964 seconds]
Reliability [Reliability]
/home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:30
  Test the drift of the EIP [BeforeEach]
  /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:43
    restart components [R00007]
    /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:225
      restart kube-controller-manager
      /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:293

  Timeline >>
  > Enter [BeforeEach] Test the drift of the EIP - /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:43 @ 01/30/24 21:01:18.749
  succeeded to create the gateway: egw-1bfce234-0554-4941-a010-f0d9816fc2a1
  v4DefaultEip: 172.18.6.3, v6DefaultEip: fc00:f853:ccd:e793::602
  Automatically polling progress:
    Reliability Test the drift of the EIP restart components restart kube-controller-manager (Spec Runtime: 20.001s)
      /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:293
      In [BeforeEach] (Node Runtime: 20s)
        /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:43

      Spec Goroutine
      goroutine 3191 [sleep]
        time.Sleep(0x1dcd6500)
          /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/time.go:195
        github.com/spidernet-io/egressgateway/test/e2e/common.CreateDaemonSet({0x2973030, 0xc0000501b8}, {0x297ac00, 0xc000145e60}, {0xc0009a4e80, 0x33}, {0xc000046246, 0x2e}, 0x0?)
          /home/runner/work/egressgateway/egressgateway/test/e2e/common/ds.go:71
      > github.com/spidernet-io/egressgateway/test/e2e/reliability_test.glob..func3.1.1()
          /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:69
            | 
            | // daemonSet
            > daemonSet, err = common.CreateDaemonSet(ctx, cli, "ds-reliability-"+uuid.NewString(), config.Image, time.Minute/2)
            | Expect(err).NotTo(HaveOccurred())
            | GinkgoWriter.Printf("succeeded to create DaemonSet: %s\n", daemonSet.Name)
        github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x0, 0x0})
          /home/runner/work/egressgateway/egressgateway/vendor/github.com/onsi/ginkgo/v2/internal/node.go:463
        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
          /home/runner/work/egressgateway/egressgateway/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:889
        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode
          /home/runner/work/egressgateway/egressgateway/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:876
  [FAILED] Unexpected error:
      <*errors.errorString | 0xc0005a1470>: 
      create DaemonSet time out
      {
          s: "create DaemonSet time out",
      }
  occurred
  In [BeforeEach] at: /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:70 @ 01/30/24 21:01:49.046

bzsuni · Answer 3 · Wed Jan 31 2024 13:50:00 GMT+0800 (China Standard Time)

@bzsuni 我查看这个 case，怀疑 shutdown node 的 case 和 正在运行 create DaemonSet 的 case 碰到一起。因为 check ds 全部 ready 的条件，导致提示超时。因为这时候假设 3 个 node，被 down 掉一个，那么 ds 的 status 2/3 Ready。
  [FAILED] Unexpected error:
      <*errors.errorString | 0xc0005a1470>: 
      create DaemonSet time out
      {
          s: "create DaemonSet time out",
      }
  occurred
  In [BeforeEach] at: /home/runner/work/egressgateway/egressgateway/test/e2e/reliability/reliability_test.go:70 

Describe 中加了 Serial，之下的所有 case 应该串行才对。
并且 AfterEach 中加了 PowerOnNodesUntilClusterReady 全部开机，并且等待所有 pod ready。
这里有点奇怪，还需要再看看

weizhoublue · Answer 4 · Thu Feb 22 2024 10:58:28 GMT+0800 (China Standard Time)

some reality indicates that restarting node is not a reliable approach
is it possible just to restart the components such as api-server ?

weizhoublue · Answer 5 · Mon Apr 15 2024 14:19:26 GMT+0800 (China Standard Time)

@lou-lan @bzsuni any update on this ?

lou-lan · Answer 6 · Mon Apr 15 2024 14:41:26 GMT+0800 (China Standard Time)

@lou-lan @bzsuni any update on this ?

It hasn't reappeared recently.

lou-lan · Answer 7 · Thu Apr 25 2024 11:22:46 GMT+0800 (China Standard Time)

Create DaemonSet Timeout：Kwok mock 节点会存在 NotReady。测试在创建一个 DaemonSet 的时候，会使用一个限制时间（比如 30s）去检查每个节点的 Pod 在该时间是否是 Running。

lou-lan · Answer 8 · Sun Apr 28 2024 15:32:52 GMT+0800 (China Standard Time)

This stage has been fixed, leaving only one #1328