etcd snapshot cannot runnning successfully

Question

etcd snapshot cannot runnning successfully

damuji8 opened this issue 6 months ago · comments

using milvus helm 4.1.9 etcd image is 3.5.5-r2.
in this image /opt/bitnami/scripts/etcd/snapshot.sh
/opt/bitnami/scripts/libetcd.sh

etcdctl_get_endpoints() {
echo "$ETCD_INITIAL_CLUSTER" | sed 's/^[^=]+=http/http/g' |sed 's/,[^=]+=/,/g'
}

i need to add env ETCD_INITIAL_CLUSTER in cronjob.

without this env . will show error "all etcd endpoints are unhealthy!"

in etcd etcd:3.5.5-debian-11-r23
/opt/bitnami/scripts/libetcd.sh
etcdctl_get_endpoints() {
local only_others=${1:-false}
local -a endpoints=()
local host domain port

ip_has_valid_hostname() {
    local ip="${1:?ip is required}"
    local parent_domain="${1:?parent_domain is required}"

    # 'getent hosts $ip' can return hostnames in 2 different formats:
    #     POD_NAME.HEADLESS_SVC_DOMAIN.NAMESPACE.svc.cluster.local (using headless service domain)
    #     10-237-136-79.SVC_DOMAIN.NAMESPACE.svc.cluster.local (using POD's IP and service domain)
    # We need to discad the latter to avoid issues when TLS verification is enabled.
    [[ "$(getent hosts "$ip")" = *"$parent_domain"* ]] && return 0
    return 1
}

hostname_has_ips() {
    local hostname="${1:?hostname is required}"
    [[ "$(getent ahosts "$hostname")" != "" ]] && return 0
    return 1
}

# This piece of code assumes this code is executed on a K8s environment
# where etcd members are part of a statefulset that uses a headless service
# to create a unique FQDN per member. Under these circumstances, the
# ETCD_ADVERTISE_CLIENT_URLS env. variable is created as follows:
#   SCHEME://POD_NAME.HEADLESS_SVC_DOMAIN:CLIENT_PORT,SCHEME://SVC_DOMAIN:SVC_CLIENT_PORT
#
# Assuming this, we can extract the HEADLESS_SVC_DOMAIN and obtain
# every available endpoint
read -r -a advertised_array <<<"$(tr ',;' ' ' <<<"$ETCD_ADVERTISE_CLIENT_URLS")"
host="$(parse_uri "${advertised_array[0]}" "host")"
port="$(parse_uri "${advertised_array[0]}" "port")"
domain="${host#"${ETCD_NAME}."}"
# When ETCD_CLUSTER_DOMAIN is set, we use that value instead of extracting
# it from ETCD_ADVERTISE_CLIENT_URLS
! is_empty_value "$ETCD_CLUSTER_DOMAIN" && domain="$ETCD_CLUSTER_DOMAIN"
# Depending on the K8s distro & the DNS plugin, it might need
# a few seconds to associate the POD(s) IP(s) to the headless svc domain
if retry_while "hostname_has_ips $domain"; then
    local -r ahosts="$(getent ahosts "$domain" | awk '{print $1}' | uniq | wc -l)"
    for i in $(seq 0 $((ahosts - 1))); do
        # We use the StatefulSet name stored in MY_STS_NAME to get the peer names based on the number of IPs registered in the headless service
        pod_name="${MY_STS_NAME}-${i}"
        if ! { [[ $only_others = true ]] && [[ "$pod_name" = "$MY_POD_NAME" ]]; }; then
            endpoints+=("${pod_name}.${ETCD_CLUSTER_DOMAIN}:${port:-2380}")
        fi
    done
fi
echo "${endpoints[*]}" | tr ' ' ','

}

bitnami helm template has the env ETCD_CLUSTER_DOMAIN and MY_STS_NAME. So we can running snapshot successfully.

i think this is problem.

shaoyue · Answer 1 · Tue Jan 23 2024 13:37:02 GMT+0800 (China Standard Time)

Hi @damuji8, thank you for this feedback. We forked the bitnami etcd docker image https://github.com/milvus-io/bitnami-docker-etcd to solve its problem when occasional initialization failure and to solve scale out problem.

We didn't use or test other functions than running or scaling, so they are very likely to be broken. We forked the repo at the tag 3.4.18-debian-10-r50, so features after this tag is not supported, either.

shaoyue · Answer 2 · Tue Jan 23 2024 13:44:43 GMT+0800 (China Standard Time)

And for this paticular case, I believe the bitnami's way of handling this is way too complicated. So I removed all the logics in etcdctl_get_endpoints() and to use ETCD_INITIAL_CLUSTER directly. Which is why you can see only one line code in etcdctl_get_endpoints().

I checked the template, and my test release, I'm sure the ETCD_CLUSTER_DOMAIN is set , but MY_STS_NAME is not set. It's because the etcd chart version we're using is 6.3.3 which is a quite old version. But it's very stable, and we're not intended to change it.

You may add it by setting env vars in helm release values.
Or it's also very welcome if you'd like to lanch a PR to add it in default values if you've got time.

damuji8 · Answer 3 · Tue Jan 23 2024 15:59:30 GMT+0800 (China Standard Time)

in your etcd.you have env ETCD_INITIAL_CLUSTER? i set this env in etcd helm template by myself.

shaoyue · Answer 4 · Tue Jan 23 2024 17:12:25 GMT+0800 (China Standard Time)

@damuji8 Yes, ETCD_INITIAL_CLUSTER is included in the statefulset template

damuji8 · Answer 5 · Tue Jan 23 2024 18:27:32 GMT+0800 (China Standard Time)

in cronjob yaml. is not included ETCD_INITIAL_CLUSTER env .

damuji8 · Answer 6 · Tue Jan 23 2024 18:28:41 GMT+0800 (China Standard Time)

without this env.can not take snapshot successfully