`clickhouse-backup server --watch` stops when initial connection fails

Question

`clickhouse-backup server --watch` stops when initial connection fails

itssimon opened this issue 3 months ago · comments

I'm running clickhouse-backup as a sidecar container with clickhouse-server on Kubernetes (using Altinity ClickHouse Operator). When the pod with both containers starts, about half of the times the clickhouse-server is not ready to accept connections from clickhouse-backup straight away, and it appears that freezes clickhouse-backup.

Often, when I restart the pod a couple of times it works eventually. Must be a timing issue.

See log output of the clickhouse-backup container below for when it fails. No backups are created and there is no further log output, even after hours.

2024/03/02 02:00:57.407705  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:00:57.411904 error clickhouse connection ping: tcp://localhost:9000 return error: dial tcp [::1]:9000: connect: connection refused logger=clickhouse
2024/03/02 02:00:57.411951 error dial tcp [::1]:9000: connect: connection refused logger=server.Run
2024/03/02 02:01:02.437006  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.463061  info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2024/03/02 02:01:02.463366  info Create integration tables logger=server
2024/03/02 02:01:02.463585  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.520820 error clickhouse connection ping: tcp://localhost:9000 return error: read: EOF logger=clickhouse
2024/03/02 02:01:02.520878 error can't connect to clickhouse: read: EOF logger=server.Run
2024/03/02 02:01:02.552043  info Starting API server on 0.0.0.0:7171 logger=server.Run
2024/03/02 02:01:02.588894  info Starting API Server in watch mode logger=server
2024/03/02 02:01:02.589005  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.589438  info Update backup metrics start (onlyLocal=false) logger=server
2024/03/02 02:01:02.590896  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.593702 error clickhouse connection ping: tcp://localhost:9000 return error: dial tcp [::1]:9000: connect: connection refused logger=clickhouse
2024/03/02 02:01:02.594849 error UpdateBackupMetrics return error: dial tcp [::1]:9000: connect: connection refused logger=server.Run
2024/03/02 02:01:02.595358 error clickhouse connection ping: tcp://localhost:9000 return error: read: EOF logger=clickhouse
2024/03/02 02:01:02.595640 error ResumeOperationsAfterRestart return error: read: EOF logger=server.Run
2024/03/02 02:01:02.596038  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/03/02 02:01:02.601335 error clickhouse connection ping: tcp://localhost:9000 return error: dial tcp [::1]:9000: connect: connection refused logger=clickhouse

Below is my pod template:

        podTemplates:
          - name: default
            spec:
              containers:
                - name: clickhouse-server
                  image: clickhouse/clickhouse-server:23.9
                  resources:
                    requests:
                      cpu: 500m
                      memory: 1Gi
                  volumeMounts:
                    - name: bootstrap-scripts
                      mountPath: /docker-entrypoint-initdb.d
                  ports:
                    - name: metrics
                      containerPort: 9363
                - name: clickhouse-backup
                  image: altinity/clickhouse-backup:2.4.33
                  command: ["/bin/clickhouse-backup", "server", "--watch"]
                  env:
                    - name: ALLOW_EMPTY_BACKUPS
                      value: "true"
                    - name: API_LISTEN
                      value: "0.0.0.0:7171"
                    - name: API_CREATE_INTEGRATION_TABLES
                      value: "true"
                    - name: BACKUPS_TO_KEEP_LOCAL
                      value: "3"
                    - name: BACKUPS_TO_KEEP_REMOTE
                      value: "336"
                    - name: REMOTE_STORAGE
                      value: s3
                    - name: S3_ENDPOINT
                      value: https://sfo3.digitaloceanspaces.com
                    - name: S3_BUCKET
                      value: xxx
                    - name: S3_PATH
                      value: cluster-{cluster}/shard-{shard}
                    - name: S3_ACCESS_KEY
                      value: xxx
                    - name: S3_SECRET_KEY
                      value: {{ "ref+sops://secrets.yaml#/clickhouseBackupSpacesSecretKey" | fetchSecretValue | quote }}
                    - name: S3_FORCE_PATH_STYLE
                      value: "true"
                    - name: WATCH_INTERVAL
                      value: "1h"
                    - name: FULL_INTERVAL
                      value: "24h"
                    - name: WATCH_BACKUP_NAME_TEMPLATE
                      value: "{time:20060102150405}-{type}"
                  ports:
                    - name: backup-api
                      containerPort: 7171
              volumes:
                - name: bootstrap-scripts
                  configMap:
                    name: bootstrap-scripts

Eugene Klimov · Answer 1 · Sat Mar 02 2024 11:43:57 GMT+0800 (China Standard Time)

thanks for reportings will fix in next release

Eugene Klimov · Answer 2 · Sat Mar 02 2024 11:45:27 GMT+0800 (China Standard Time)

As a workaround I would to like propose using CronJob instead of watch, look examples
https://github.com/Altinity/clickhouse-backup/blob/master/Examples.md#how-to-use-clickhouse-backup-in-kubernetes