cluster-init job seems to fail due to starting too early

Question

cluster-init job seems to fail due to starting too early

umegaya opened this issue 3 years ago · comments

hi, thank you for creating helm template for cockroachdb, it helps to run the db on k8s so much.

I've been used this chart for a years, without pinning version. after recent change about creating certificate, job my-db-cockroach-init fails like following

++ /cockroach/cockroach init --certs-dir=/cockroach-certs/ --host=my-db-cockroachdb-0.my-db-cockroachdb:26257
E211019 00:28:06.335840 1 cli/error.go:229  server closed the connection.
Is this a CockroachDB node?

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup my-db-cockroachdb-0.my-db-cockroachdb on 172.20.0.10:53: no such host"
Error: server closed the connection.
Is this a CockroachDB node?

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup my-db-cockroachdb-0.my-db-cockroachdb on 172.20.0.10:53: no such host"
Failed running "init"

if we restart the my-db-cockroach-init a few times, the error disappears and cockroachdb can bootstrap normally.
so we guess this caused by cluster-init job just starting too early.

we can workaround the problem by restarting cluster-init job until success like following.

    printf "waiting cockroach db init success."
    while :
    do
        local cockroach_boot_pod=$(full_pod_name "${release_name}-cockroachdb-init")
        if [ ! -z "${cockroach_boot_pod}" ]; then
            local phase=$(kubectl get po -o json ${cockroach_boot_pod} | jq -r .status.phase)
            printf "${phase}."
            if [ "${phase}" = "Succeeded" ]; then
                local success=$(kubectl logs ${cockroach_boot_pod} | grep -E "(Cluster successfully initialized|cluster has already been initialize)")
                if [ ! -z "${success}" ]; then
                    echo "done."
                    break
                else
                    echo "cockroach boot pod fails, restart now."
                    kubectl get po -o json ${cockroach_boot_pod} > /tmp/cockroach_boot_pod.json
                    kubectl delete po ${cockroach_boot_pod}
                    kubectl apply -f /tmp/cockroach_boot_pod.json
                fi
            fi
        else
            local rc=$?
            printf "${rc}."
        fi
        sleep 5
    done

any idea about why this happens? and any better workarounds?

regards,

Yandu Oppacher · Answer 1 · Tue Oct 19 2021 22:04:55 GMT+0800 (China Standard Time)

Can you attach steps to reproduce this issue please?

Steven Fines · Answer 2 · Wed Feb 16 2022 03:23:30 GMT+0800 (China Standard Time)

I also seem to be hitting this problem, or it isn't even attempting to launch the init job at all using 21.1.15 as the version

Chris Seto · Answer 3 · Wed Feb 23 2022 03:10:28 GMT+0800 (China Standard Time)

@umegaya The init job already has a retry script built into it. It's normal to see the job reporting errors for up to five minutes (This is a gross overestimation) after both the job and first pod of the cluster have started running.

Have you been able to reproduce the issue on more recent versions of the helm chart?

@sfines-clgx would you mind sharing the values file that you're using? If the job isn't even launching its most likely a configuration issue.

Clark McCauley · Answer 4 · Tue Mar 08 2022 08:51:49 GMT+0800 (China Standard Time)

The init job is not running for me either and I'm using the default values.yaml file without any modifications. I'm using the 7.0.0 tag for the helm chart.

Olle Törnström · Answer 5 · Thu May 05 2022 21:44:33 GMT+0800 (China Standard Time)

Hi, i've just spent a couple of days trying to get a local CRDB-cluster up and running in a cluster with FluxV2 and Helm, and after a lot of guess-changing of values - and other people getting it to work with manual Helm installations, what did the trick was spec.install.disableWait: true.

Without that set the init job never ran at all.

Samuel Torres · Answer 6 · Mon Jun 20 2022 21:59:39 GMT+0800 (China Standard Time)

I think we should remove the hooks and trust the idempotency / retry-mechanisms of the script to run it alongside the statefulset setup. I don't think it should be needed the disabling of the wait flag on Flux to make this process to happen.
WDYT ?

Jonas Badstübner · Answer 7 · Fri Dec 23 2022 10:00:52 GMT+0800 (China Standard Time)

So for me it is not working either. I think the init job didn't run. I'm tempted to build my own job for that from the template, but this should not be necessary, right?

Jonas Badstübner · Answer 8 · Fri Dec 23 2022 10:03:10 GMT+0800 (China Standard Time)

I am not sure, but the helm hook make the job run only after the cluster is healthy, but it doesn't get healthy, since the job never ran.
Is this, how it works? :)

Jonas Badstübner · Answer 9 · Fri Dec 23 2022 10:17:38 GMT+0800 (China Standard Time)

Replace the value of the env variable with your cluster name and kubectl apply it. Is not the cleanest workaround, but got my cluster bootstrapped, so I'm fine with that (for now) - would be nice to have a "native" fix though. :)

---
kind: Job
apiVersion: batch/v1
metadata:
  name: cockroachdb-init
  namespace: crdb
spec:
  template:
    spec:
      restartPolicy: OnFailure
      terminationGracePeriodSeconds: 0
      serviceAccountName: cockroachdb
      initContainers:
        - name: copy-certs
          image: busybox
          imagePullPolicy: IfNotPresent
          command:
            - /bin/sh
            - -c
            - "cp -f /certs/* /cockroach-certs/; chmod 0400 /cockroach-certs/*.key"
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          volumeMounts:
            - name: client-certs
              mountPath: /cockroach-certs/
            - name: certs-secret
              mountPath: /certs/
      containers:
        - name: cluster-init
          image: "cockroachdb/cockroach:v22.2.1"
          imagePullPolicy: IfNotPresent
          # Run the command in an `while true` loop because this Job is bound
          # to come up before the CockroachDB Pods (due to the time needed to
          # get PersistentVolumes attached to Nodes), and sleeping 5 seconds
          # between attempts is much better than letting the Pod fail when
          # the init command does and waiting out Kubernetes' non-configurable
          # exponential back-off for Pod restarts.
          # Command completes either when cluster initialization succeeds,
          # or when cluster has been initialized already.
          command:
          - /bin/bash
          - -c
          - >-
              initCluster() {
                while true; do
                  local output=$(
                    set -x;

                    /cockroach/cockroach init \
                      --certs-dir=/cockroach-certs/ \
                      --cluster-name="${CLUSTER_NAME}" \
                      --host=cockroachdb-0.cockroachdb:26257
                  2>&1);

                  local exitCode="$?";
                  echo $output;

                  if [[ "$exitCode" == "0" || "$output" == *"cluster has already been initialized"* ]]
                    then break;
                  fi

                  sleep 5;
                done
              }

              initCluster;
          env:
            - name: CLUSTER_NAME
              value: your-cluster
          volumeMounts:
            - name: client-certs
              mountPath: /cockroach-certs/
      volumes:
        - name: client-certs
          emptyDir: {}
        - name: certs-secret
          projected:
            sources:
            - secret:
                name: cockroachdb-client-secret
                items:
                - key: ca.crt
                  path: ca.crt
                  mode: 0400
                - key: tls.crt
                  path: client.root.crt
                  mode: 0400
                - key: tls.key
                  path: client.root.key
                  mode: 0400

Prafull Ladha · Answer 10 · Wed Jun 07 2023 01:47:57 GMT+0800 (China Standard Time)

I was able to reproduce this issue once in a while. I have added a fix for this in mentioned PR. The change for the fix is here

Prafull Ladha · Answer 11 · Fri Jun 09 2023 02:55:13 GMT+0800 (China Standard Time)

Fixed in favour of #316

Jonas Badstübner · Answer 12 · Fri Jun 09 2023 05:29:05 GMT+0800 (China Standard Time)

I will test this on my setup when I find the time. Thank you for your work!

chokosabe · Answer 13 · Mon Aug 28 2023 23:53:56 GMT+0800 (China Standard Time)

Still seeing this issue:

Values file:

clusterDomain: cluster.local
conf:
attrs: []
cache: 25%
cluster-name: ''
disable-cluster-name-verification: false
http-port: 8080
join: []
locality: ''
log:
config: {}
enabled: false
logtostderr: INFO
max-sql-memory: 25%
path: cockroach-data
port: 26257
single-node: false
sql-audit-dir: ''
store:
attrs: null
enabled: false
size: null
type: null
iap:
enabled: false
image:
credentials: {}
pullPolicy: IfNotPresent
repository: cockroachdb/cockroach
tag: v23.1.8
ingress:
annotations: {}
enabled: false
hosts: []
labels: {}
paths:
- /
tls: []
init:
affinity: {}
annotations: {}
jobAnnotations: {}
labels:
app.kubernetes.io/component: init
nodeSelector: {}
provisioning:
clusterSettings: null
databases: []
enabled: false
users: []
resources: {}
securityContext:
enabled: true
tolerations: []
labels: {}
networkPolicy:
enabled: false
ingress:
grpc: []
http: []
prometheus:
enabled: true
securityContext:
enabled: true
service:
discovery:
annotations: {}
labels:
app.kubernetes.io/component: cockroachdb
ports:
grpc:
external:
name: grpc
port: 26257
internal:
name: grpc-internal
port: 26257
http:
name: http
port: 8080
public:
annotations: {}
labels:
app.kubernetes.io/component: cockroachdb
type: ClusterIP
serviceMonitor:
annotations: {}
enabled: false
interval: 10s
labels: {}
namespaced: false
statefulset:
annotations: {}
args: []
budget:
maxUnavailable: 1
customLivenessProbe: {}
customReadinessProbe: {}
env: []
labels:
app.kubernetes.io/component: cockroachdb
nodeAffinity: {}
nodeSelector: {}
podAffinity: {}
podAntiAffinity:
topologyKey: kubernetes.io/hostname
type: soft
weight: 100
podManagementPolicy: Parallel
priorityClassName: ''
replicas: 3
resources: {}
secretMounts: []
securityContext:
enabled: true
serviceAccount:
annotations: {}
create: true
name: ''
tolerations: []
topologySpreadConstraints:
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
updateStrategy:
type: RollingUpdate
storage:
hostPath: ''
persistentVolume:
annotations: {}
enabled: true
labels: {}
size: 20Gi
storageClass: do-block-storage-retain
tls:
certs:
certManager: false
certManagerIssuer:
clientCertDuration: 672h
clientCertExpiryWindow: 48h
group: cert-manager.io
kind: Issuer
name: cockroachdb
nodeCertDuration: 8760h
nodeCertExpiryWindow: 168h
clientRootSecret: cockroachdb-root
nodeSecret: cockroachdb-node
provided: false
selfSigner:
caCertDuration: 43800h
caCertExpiryWindow: 648h
caProvided: false
caSecret: ''
clientCertDuration: 672h
clientCertExpiryWindow: 48h
enabled: true
minimumCertDuration: 624h
nodeCertDuration: 8760h
nodeCertExpiryWindow: 168h
podUpdateTimeout: 2m
readinessWait: 30s
rotateCerts: true
securityContext:
enabled: true
svcAccountAnnotations: {}
tlsSecret: false
useCertManagerV1CRDs: false
copyCerts:
image: busybox
enabled: false
selfSigner:
image:
credentials: {}
pullPolicy: IfNotPresent
registry: gcr.io
repository: cockroachlabs-helm-charts/cockroach-self-signer-cert
tag: '1.4'
global:
cattle:
systemProjectId: p-sccm5

Error logs:

I230828 15:47:01.812335 154 server/init.go:421 ⋮ [T1,n?] 2066 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:02.810583 154 server/init.go:421 ⋮ [T1,n?] 2067 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:03.810179 154 server/init.go:421 ⋮ [T1,n?] 2068 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:04.809911 154 server/init.go:421 ⋮ [T1,n?] 2069 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:05.811005 154 server/init.go:421 ⋮ [T1,n?] 2070 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:06.810634 154 server/init.go:421 ⋮ [T1,n?] 2071 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:07.810409 154 server/init.go:421 ⋮ [T1,n?] 2072 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:08.810221 154 server/init.go:421 ⋮ [T1,n?] 2073 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry