Etcd backup operator seem to miss schedule if operator pod/container is restarted
elvinasp opened this issue · comments
Environment:
K8s is running within Azure.
We have set up a 3 node etcd cluster and set 3 backups (hourly, daily, weekly) with backup directly to Azure blob storage.
What is observed:
Looking at the backup history in the Azure there are gaps in the backup cycle. These gaps are mostly visible with longer backup cycles.
When looked at etcd-backup-operator pod logs there are multiple restart events within timeframe of the missing backups. If I correctly understood restarts were happening due to etcd leader election or something like that.
To validate my suspicions I have set the following script to kill the backup operator pod and later only the container and set it via Cron to happen every 10 minutes. I have set the backup every 20 minutes. As a result backup was never done since 04:39 UTC time, when I started to experiment. Well after 6 restarts pod got into Error state. I will try to continue with less aggressive restart cron schedule to see if that has impact.
Expected result:
Backup is happening according to the schedule regardless of container restarts. Schedule timer should not be linked to container lifetime as container may die any time. Or is it a feature due to the way Kubernetes works?
Script:
#!/bin/bash
cd /root
date +"%Y %m %d - %H:%M" 2>&1 >> kill-operator.log
/usr/local/bin/kubectl -n tep-k8s-test-01 exec -c etcd-backup-operator $(/usr/local/bin/kubectl -n tep-k8s-test-01 get po -l name=etcd-backup-operator -o name) -- /bin/kill -5 1 2>&1 >> kill-operator.log
echo "----" 2>&1 >> kill-operator.log
Edited backup schedule:
root@atl-cj1-m-ducx:~# kubectl -n tep-k8s-test-01 describe EtcdBackup etcd-cluster-backup-weekly
Name: etcd-cluster-backup-weekly
Namespace: tep-k8s-test-01
Labels: <none>
Annotations: <none>
API Version: etcd.database.coreos.com/v1beta2
Kind: EtcdBackup
Metadata:
Creation Timestamp: 2020-01-15T07:54:50Z
Finalizers:
backup-operator-periodic
Generation: 145
Resource Version: 81580419
Self Link: /apis/etcd.database.coreos.com/v1beta2/namespaces/tep-k8s-test-01/etcdbackups/etcd-cluster-backup-weekly
UID: 7dd4c2a7-e1e0-4fe1-ae04-100be7ff6d65
Spec:
Abs:
Abs Secret: storage-account-credentials-weekly
Path: tep-k8s-test-01/etcd.backup
Backup Policy:
Backup Interval In Second: 1200
Etcd Endpoints:
http://etcd-cluster-client:2379
Storage Type: ABS
Status:
Etcd Revision: 1098811
Etcd Version: 3.4.3
Last Success Date: 2020-01-27T04:39:09Z
Succeeded: true
Events: <none>
root@atl-cj1-m-ducx:~# date
Mon Jan 27 09:05:37 UTC 2020