cluster-init job seems to fail due to starting too early
umegaya opened this issue · comments
hi, thank you for creating helm template for cockroachdb, it helps to run the db on k8s so much.
I've been used this chart for a years, without pinning version. after recent change about creating certificate, job my-db-cockroach-init
fails like following
++ /cockroach/cockroach init --certs-dir=/cockroach-certs/ --host=my-db-cockroachdb-0.my-db-cockroachdb:26257
E211019 00:28:06.335840 1 cli/error.go:229 server closed the connection.
Is this a CockroachDB node?
initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup my-db-cockroachdb-0.my-db-cockroachdb on 172.20.0.10:53: no such host"
Error: server closed the connection.
Is this a CockroachDB node?
initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup my-db-cockroachdb-0.my-db-cockroachdb on 172.20.0.10:53: no such host"
Failed running "init"
if we restart the my-db-cockroach-init
a few times, the error disappears and cockroachdb can bootstrap normally.
so we guess this caused by cluster-init job just starting too early.
we can workaround the problem by restarting cluster-init job until success like following.
printf "waiting cockroach db init success."
while :
do
local cockroach_boot_pod=$(full_pod_name "${release_name}-cockroachdb-init")
if [ ! -z "${cockroach_boot_pod}" ]; then
local phase=$(kubectl get po -o json ${cockroach_boot_pod} | jq -r .status.phase)
printf "${phase}."
if [ "${phase}" = "Succeeded" ]; then
local success=$(kubectl logs ${cockroach_boot_pod} | grep -E "(Cluster successfully initialized|cluster has already been initialize)")
if [ ! -z "${success}" ]; then
echo "done."
break
else
echo "cockroach boot pod fails, restart now."
kubectl get po -o json ${cockroach_boot_pod} > /tmp/cockroach_boot_pod.json
kubectl delete po ${cockroach_boot_pod}
kubectl apply -f /tmp/cockroach_boot_pod.json
fi
fi
else
local rc=$?
printf "${rc}."
fi
sleep 5
done
any idea about why this happens? and any better workarounds?
regards,
Can you attach steps to reproduce this issue please?
I also seem to be hitting this problem, or it isn't even attempting to launch the init job at all using 21.1.15 as the version
@umegaya The init job already has a retry script built into it. It's normal to see the job reporting errors for up to five minutes (This is a gross overestimation) after both the job and first pod of the cluster have started running.
Have you been able to reproduce the issue on more recent versions of the helm chart?
@sfines-clgx would you mind sharing the values file that you're using? If the job isn't even launching its most likely a configuration issue.
The init job is not running for me either and I'm using the default values.yaml
file without any modifications. I'm using the 7.0.0 tag for the helm chart.
Hi, i've just spent a couple of days trying to get a local CRDB-cluster up and running in a cluster with FluxV2 and Helm, and after a lot of guess-changing of values - and other people getting it to work with manual Helm installations, what did the trick was spec.install.disableWait: true
.
Without that set the init job never ran at all.
I think we should remove the hooks and trust the idempotency / retry-mechanisms of the script to run it alongside the statefulset setup. I don't think it should be needed the disabling of the wait flag on Flux to make this process to happen.
WDYT ?
So for me it is not working either. I think the init job didn't run. I'm tempted to build my own job for that from the template, but this should not be necessary, right?
I am not sure, but the helm hook make the job run only after the cluster is healthy, but it doesn't get healthy, since the job never ran.
Is this, how it works? :)
Replace the value of the env variable with your cluster name and kubectl apply
it. Is not the cleanest workaround, but got my cluster bootstrapped, so I'm fine with that (for now) - would be nice to have a "native" fix though. :)
---
kind: Job
apiVersion: batch/v1
metadata:
name: cockroachdb-init
namespace: crdb
spec:
template:
spec:
restartPolicy: OnFailure
terminationGracePeriodSeconds: 0
serviceAccountName: cockroachdb
initContainers:
- name: copy-certs
image: busybox
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- "cp -f /certs/* /cockroach-certs/; chmod 0400 /cockroach-certs/*.key"
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: client-certs
mountPath: /cockroach-certs/
- name: certs-secret
mountPath: /certs/
containers:
- name: cluster-init
image: "cockroachdb/cockroach:v22.2.1"
imagePullPolicy: IfNotPresent
# Run the command in an `while true` loop because this Job is bound
# to come up before the CockroachDB Pods (due to the time needed to
# get PersistentVolumes attached to Nodes), and sleeping 5 seconds
# between attempts is much better than letting the Pod fail when
# the init command does and waiting out Kubernetes' non-configurable
# exponential back-off for Pod restarts.
# Command completes either when cluster initialization succeeds,
# or when cluster has been initialized already.
command:
- /bin/bash
- -c
- >-
initCluster() {
while true; do
local output=$(
set -x;
/cockroach/cockroach init \
--certs-dir=/cockroach-certs/ \
--cluster-name="${CLUSTER_NAME}" \
--host=cockroachdb-0.cockroachdb:26257
2>&1);
local exitCode="$?";
echo $output;
if [[ "$exitCode" == "0" || "$output" == *"cluster has already been initialized"* ]]
then break;
fi
sleep 5;
done
}
initCluster;
env:
- name: CLUSTER_NAME
value: your-cluster
volumeMounts:
- name: client-certs
mountPath: /cockroach-certs/
volumes:
- name: client-certs
emptyDir: {}
- name: certs-secret
projected:
sources:
- secret:
name: cockroachdb-client-secret
items:
- key: ca.crt
path: ca.crt
mode: 0400
- key: tls.crt
path: client.root.crt
mode: 0400
- key: tls.key
path: client.root.key
mode: 0400
I was able to reproduce this issue once in a while. I have added a fix for this in mentioned PR. The change for the fix is here
Fixed in favour of #316
I will test this on my setup when I find the time. Thank you for your work!
Still seeing this issue:
Values file:
clusterDomain: cluster.local
conf:
attrs: []
cache: 25%
cluster-name: ''
disable-cluster-name-verification: false
http-port: 8080
join: []
locality: ''
log:
config: {}
enabled: false
logtostderr: INFO
max-sql-memory: 25%
path: cockroach-data
port: 26257
single-node: false
sql-audit-dir: ''
store:
attrs: null
enabled: false
size: null
type: null
iap:
enabled: false
image:
credentials: {}
pullPolicy: IfNotPresent
repository: cockroachdb/cockroach
tag: v23.1.8
ingress:
annotations: {}
enabled: false
hosts: []
labels: {}
paths:
- /
tls: []
init:
affinity: {}
annotations: {}
jobAnnotations: {}
labels:
app.kubernetes.io/component: init
nodeSelector: {}
provisioning:
clusterSettings: null
databases: []
enabled: false
users: []
resources: {}
securityContext:
enabled: true
tolerations: []
labels: {}
networkPolicy:
enabled: false
ingress:
grpc: []
http: []
prometheus:
enabled: true
securityContext:
enabled: true
service:
discovery:
annotations: {}
labels:
app.kubernetes.io/component: cockroachdb
ports:
grpc:
external:
name: grpc
port: 26257
internal:
name: grpc-internal
port: 26257
http:
name: http
port: 8080
public:
annotations: {}
labels:
app.kubernetes.io/component: cockroachdb
type: ClusterIP
serviceMonitor:
annotations: {}
enabled: false
interval: 10s
labels: {}
namespaced: false
statefulset:
annotations: {}
args: []
budget:
maxUnavailable: 1
customLivenessProbe: {}
customReadinessProbe: {}
env: []
labels:
app.kubernetes.io/component: cockroachdb
nodeAffinity: {}
nodeSelector: {}
podAffinity: {}
podAntiAffinity:
topologyKey: kubernetes.io/hostname
type: soft
weight: 100
podManagementPolicy: Parallel
priorityClassName: ''
replicas: 3
resources: {}
secretMounts: []
securityContext:
enabled: true
serviceAccount:
annotations: {}
create: true
name: ''
tolerations: []
topologySpreadConstraints:
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
updateStrategy:
type: RollingUpdate
storage:
hostPath: ''
persistentVolume:
annotations: {}
enabled: true
labels: {}
size: 20Gi
storageClass: do-block-storage-retain
tls:
certs:
certManager: false
certManagerIssuer:
clientCertDuration: 672h
clientCertExpiryWindow: 48h
group: cert-manager.io
kind: Issuer
name: cockroachdb
nodeCertDuration: 8760h
nodeCertExpiryWindow: 168h
clientRootSecret: cockroachdb-root
nodeSecret: cockroachdb-node
provided: false
selfSigner:
caCertDuration: 43800h
caCertExpiryWindow: 648h
caProvided: false
caSecret: ''
clientCertDuration: 672h
clientCertExpiryWindow: 48h
enabled: true
minimumCertDuration: 624h
nodeCertDuration: 8760h
nodeCertExpiryWindow: 168h
podUpdateTimeout: 2m
readinessWait: 30s
rotateCerts: true
securityContext:
enabled: true
svcAccountAnnotations: {}
tlsSecret: false
useCertManagerV1CRDs: false
copyCerts:
image: busybox
enabled: false
selfSigner:
image:
credentials: {}
pullPolicy: IfNotPresent
registry: gcr.io
repository: cockroachlabs-helm-charts/cockroach-self-signer-cert
tag: '1.4'
global:
cattle:
systemProjectId: p-sccm5
Error logs:
I230828 15:47:01.812335 154 server/init.go:421 ⋮ [T1,n?] 2066 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:02.810583 154 server/init.go:421 ⋮ [T1,n?] 2067 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:03.810179 154 server/init.go:421 ⋮ [T1,n?] 2068 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:04.809911 154 server/init.go:421 ⋮ [T1,n?] 2069 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:05.811005 154 server/init.go:421 ⋮ [T1,n?] 2070 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:06.810634 154 server/init.go:421 ⋮ [T1,n?] 2071 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:07.810409 154 server/init.go:421 ⋮ [T1,n?] 2072 ‹cockroachdb-0.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry I230828 15:47:08.810221 154 server/init.go:421 ⋮ [T1,n?] 2073 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry