`clickhouse-backup server --watch` stops when initial connection fails
itssimon opened this issue · comments
I'm running clickhouse-backup as a sidecar container with clickhouse-server on Kubernetes (using Altinity ClickHouse Operator). When the pod with both containers starts, about half of the times the clickhouse-server is not ready to accept connections from clickhouse-backup straight away, and it appears that freezes clickhouse-backup.
Often, when I restart the pod a couple of times it works eventually. Must be a timing issue.
See log output of the clickhouse-backup container below for when it fails. No backups are created and there is no further log output, even after hours.
2024/03/02 02:00:57.407705 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:00:57.411904 error clickhouse connection ping: tcp://localhost:9000 return error: dial tcp [::1]:9000: connect: connection refused logger=clickhouse
2024/03/02 02:00:57.411951 error dial tcp [::1]:9000: connect: connection refused logger=server.Run
2024/03/02 02:01:02.437006 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.463061 info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2024/03/02 02:01:02.463366 info Create integration tables logger=server
2024/03/02 02:01:02.463585 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.520820 error clickhouse connection ping: tcp://localhost:9000 return error: read: EOF logger=clickhouse
2024/03/02 02:01:02.520878 error can't connect to clickhouse: read: EOF logger=server.Run
2024/03/02 02:01:02.552043 info Starting API server on 0.0.0.0:7171 logger=server.Run
2024/03/02 02:01:02.588894 info Starting API Server in watch mode logger=server
2024/03/02 02:01:02.589005 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.589438 info Update backup metrics start (onlyLocal=false) logger=server
2024/03/02 02:01:02.590896 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.593702 error clickhouse connection ping: tcp://localhost:9000 return error: dial tcp [::1]:9000: connect: connection refused logger=clickhouse
2024/03/02 02:01:02.594849 error UpdateBackupMetrics return error: dial tcp [::1]:9000: connect: connection refused logger=server.Run
2024/03/02 02:01:02.595358 error clickhouse connection ping: tcp://localhost:9000 return error: read: EOF logger=clickhouse
2024/03/02 02:01:02.595640 error ResumeOperationsAfterRestart return error: read: EOF logger=server.Run
2024/03/02 02:01:02.596038 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.601335 error clickhouse connection ping: tcp://localhost:9000 return error: dial tcp [::1]:9000: connect: connection refused logger=clickhouse
Below is my pod template:
podTemplates:
- name: default
spec:
containers:
- name: clickhouse-server
image: clickhouse/clickhouse-server:23.9
resources:
requests:
cpu: 500m
memory: 1Gi
volumeMounts:
- name: bootstrap-scripts
mountPath: /docker-entrypoint-initdb.d
ports:
- name: metrics
containerPort: 9363
- name: clickhouse-backup
image: altinity/clickhouse-backup:2.4.33
command: ["/bin/clickhouse-backup", "server", "--watch"]
env:
- name: ALLOW_EMPTY_BACKUPS
value: "true"
- name: API_LISTEN
value: "0.0.0.0:7171"
- name: API_CREATE_INTEGRATION_TABLES
value: "true"
- name: BACKUPS_TO_KEEP_LOCAL
value: "3"
- name: BACKUPS_TO_KEEP_REMOTE
value: "336"
- name: REMOTE_STORAGE
value: s3
- name: S3_ENDPOINT
value: https://sfo3.digitaloceanspaces.com
- name: S3_BUCKET
value: xxx
- name: S3_PATH
value: cluster-{cluster}/shard-{shard}
- name: S3_ACCESS_KEY
value: xxx
- name: S3_SECRET_KEY
value: {{ "ref+sops://secrets.yaml#/clickhouseBackupSpacesSecretKey" | fetchSecretValue | quote }}
- name: S3_FORCE_PATH_STYLE
value: "true"
- name: WATCH_INTERVAL
value: "1h"
- name: FULL_INTERVAL
value: "24h"
- name: WATCH_BACKUP_NAME_TEMPLATE
value: "{time:20060102150405}-{type}"
ports:
- name: backup-api
containerPort: 7171
volumes:
- name: bootstrap-scripts
configMap:
name: bootstrap-scripts
thanks for reportings will fix in next release
As a workaround I would to like propose using CronJob
instead of watch, look examples
https://github.com/Altinity/clickhouse-backup/blob/master/Examples.md#how-to-use-clickhouse-backup-in-kubernetes