cockroachdb / cockroach-operator

k8s operator for CRDB

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot decommission node

sabuhigr opened this issue · comments

While Decommission with Operator, Operator gives error like that

{"level":"warn","ts":1665138574.1910405,"logger":"controller.CrdbCluster","msg":"scaling down stateful set","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"kF7Vns39vPGnqXncUhmWnX","have":5,"want":4}
{"level":"error","ts":1665138574.8271742,"logger":"controller.CrdbCluster","msg":"decommission failed","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"kF7Vns39vPGnqXncUhmWnX","error":"failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).findNodeID\n  | \tpkg/scale/drainer.go:242\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:79\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:91\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:143\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:153\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1581\nWraps: (2) failed to stream execution results back\nWraps: (3) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:145\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:153\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1665138574.8283174,"logger":"controller.CrdbCluster","msg":"Error on action","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"kF7Vns39vPGnqXncUhmWnX","Action":"Decommission","err":"failed to stream execution results back: command terminated with exit code 1"}
{"level":"error","ts":1665138574.8283627,"logger":"controller.CrdbCluster","msg":"action failed","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"kF7Vns39vPGnqXncUhmWnX","error":"failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).findNodeID\n  | \tpkg/scale/drainer.go:242\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:79\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:91\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:143\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:153\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1581\nWraps: (2) failed to stream execution results back\nWraps: (3) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:185\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1665138574.836441,"logger":"controller-runtime.manager.controller.crdbcluster","msg":"Reconciler error","reconciler group":"crdb.cockroachlabs.com","reconciler kind":"CrdbCluster","name":"cockroachdb","namespace":"cockroach-cluster-stage","error":"failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).findNodeID\n  | \tpkg/scale/drainer.go:242\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:79\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:91\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:143\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:153\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1581\nWraps: (2) failed to stream execution results back\nWraps: (3) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:301\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1665138584.3979504,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"URAajRJhYWotEB4tQs6hRm"}
{"level":"info","ts":1665138584.3980412,"logger":"webhooks","msg":"default","name":"cockroachdb"}
{"level":"info","ts":1665138584.4027824,"logger":"controller.CrdbCluster","msg":"Running action with name: Decommission","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"URAajRJhYWotEB4tQs6hRm"}
{"level":"warn","ts":1665138584.4028075,"logger":"controller.CrdbCluster","msg":"check decommission opportunities","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"URAajRJhYWotEB4tQs6hRm"}
{"level":"info","ts":1665138584.4028518,"logger":"controller.CrdbCluster","msg":"replicas decommissioning","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"URAajRJhYWotEB4tQs6hRm","status.CurrentReplicas":5,"expected":4}
{"level":"warn","ts":1665138584.4028952,"logger":"controller.CrdbCluster","msg":"operator is running inside of kubernetes, connecting to service for db connection","CrdbCluster":"cockroach-cluster-stage/cockroachdb","ReconcileId":"URAajRJhYWotEB4tQs6hRm"}

When I try to decommission with cockroach client, also it gives us error like that

$ kubectl -n cockroach-cluster-stage exec -it cockroachdb-client-secure -- ./cockroach node decommission 4 --certs-dir=/cockroach/cockroach-certs --host=cockroachdb-public.cockroach-cluster-stage
ERROR: operation timed out.

failed to connect to the node: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6.001s (given timeout 6s): rpc error: code = DeadlineExceeded desc = context deadline exceeded
Failed running "node decommission"
command terminated with exit code 1

But in real, I can connect all nodes inside Cockroach cluster inside K8S and all is live state.

$kubectl -n cockroach-cluster-stage exec -it cockroachdb-client-secure -- ./cockroach node status --decommission --certs-dir=/cockroach/cockroach-certs --host=cockroachdb-public.cockroach-cluster-stage
  id |                         address                         |                       sql_address                       |  build  |         started_at         |         updated_at         | locality | is_available | is_live | gossiped_replicas | is_decommissioning | membership | is_draining
-----+---------------------------------------------------------+---------------------------------------------------------+---------+----------------------------+----------------------------+----------+--------------+---------+-------------------+--------------------+------------+--------------
   1 | cockroachdb-0.cockroachdb.cockroach-cluster-stage:26258 | cockroachdb-0.cockroachdb.cockroach-cluster-stage:26257 | v22.1.2 | 2022-10-07 11:21:29.236929 | 2022-10-07 11:47:48.769312 |          | true         | true    |                33 | false              | active     | false
   2 | cockroachdb-1.cockroachdb.cockroach-cluster-stage:26258 | cockroachdb-1.cockroachdb.cockroach-cluster-stage:26257 | v22.1.2 | 2022-10-07 11:21:29.497315 | 2022-10-07 11:47:49.009594 |          | true         | true    |                34 | false              | active     | false
   3 | cockroachdb-2.cockroachdb.cockroach-cluster-stage:26258 | cockroachdb-2.cockroachdb.cockroach-cluster-stage:26257 | v22.1.2 | 2022-10-07 11:21:29.618476 | 2022-10-07 11:47:49.135673 |          | true         | true    |                33 | false              | active     | false
   4 | cockroachdb-3.cockroachdb.cockroach-cluster-stage:26258 | cockroachdb-3.cockroachdb.cockroach-cluster-stage:26257 | v22.1.2 | 2022-10-07 11:31:22.817727 | 2022-10-07 11:47:48.334172 |          | true         | true    |                32 | false              | active     | false

I installed latest operator and issue is relevant for me as well
Looks like fix was already merged to master
when threre will be release?

Hey there! I believe the fix for this was added in v2.9.0 which was released towards the end of last year. I'm going to close this assuming that's correct. Of course if you still run into issues with this, please feel free to re-open this issue.