kubernetes: default corefile structure

Question

kubernetes: default corefile structure

chrisohaver opened this issue 5 years ago · comments

Currently the default corefile deployed by our deployment script uses a single server block. e.g.

    .:53 {
        errors
        health
        ready
        kubernetes <CLUSTER_DOMAIN> <REVERSE_ZONES...> {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

Let's discuss concrete pros and cons of alternative defaults that are structured differently (e.g. multiple server blocks).

Chris O'Haver · Answer 1 · Tue Apr 16 2019 22:46:44 GMT+0800 (China Standard Time)

As separate server blocks, and without the reverse zone fallthrough, this looks something like...

    <CLUSTER_DOMAIN> <REVERSE_ZONES...> {
        health
        ready
        errors
        kubernetes . {
          pods insecure
        }
        loadbalance
    }

    . {
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        cache 30
        reload
    }

Although it has been urged that we do so, it's not clear to me what the concrete benefit of splitting up into multiple server blocks is. There is some efficiency gained, i think, since the split up plugin chains are shorter than the single chain. Each query traverses less plugins (though I think that in itself would be a fairly minor gain). I'm fuzzy about how this relates to plugins like prometheus... would we log metrics for queries routed to both server blocks in the above example?

I think this would leave kubernetes records uncached. But I think is good, since kubernetes already caches data in its client-go connection to the k8s API.

Splitting into multiple blocks means we cannot use fallthrough in kubernetes to forward queries to the upstream name servers. This requires us to define the reverse zones as only those which we are authoritative for.

One detail that makes this impractical to deploy is that we don't have a programatic way get all the reverse zones of the kubernetes cluster. In some clusters pod CIDR ranges are defined per node, so the list of zones can change as nodes are added/removed. We could instead ...

make the server block authoritative for all reverse zones as we did before but without fallthrough, not allowing any reverse lookups upstream. This technically would not violate the dns spec, although I don't think it's a very usable solution. (I think this is how kube-dns did it, though i'm not 100% certain)
make the server block not authoritative for any reverse zones, but this would violate the DNS spec.

@fturib, you also had a concern about multiple zones for a single server block. Can you elaborate on that here?

Chris O'Haver · Answer 2 · Tue Apr 16 2019 22:48:12 GMT+0800 (China Standard Time)

FYI ... @miekg @johnbelamaric @fturib @rajansandeep

Francois Tur · Answer 3 · Tue Apr 16 2019 23:17:27 GMT+0800 (China Standard Time)

Yes, from Caddy behavior, the setup function of each plugin will be called as many keys are defined in the ServerBlock. It is defined in executeDirectives here

In other words, with such configuration we would have as many instance of Kubernetes structure as Zones that are defined in this server block:

<CLUSTER_DOMAIN> <REVERSE_ZONES...> {
       ...
        kubernetes . {
          pods insecure
        }
        ...
    }

And each of the kubernetes structure will contains the object caches of k8s Client API.

I guess we cannot go this way because of scalability issue depending size of the cluster

We may want also to fix that useless (I think so far - but not completely sure) duplication by revising the function InspectServerBlocks here

Chris O'Haver · Answer 4 · Tue Apr 16 2019 23:50:09 GMT+0800 (China Standard Time)

@fturib, How do other plugins cope with that, e.g. health, metrics. Multiple instances of those would collide on the listening port.

Francois Tur · Answer 5 · Wed Apr 17 2019 04:39:10 GMT+0800 (China Standard Time)

It is not manage by other plugins.
For metrics the problem is larger than only the KEY at the ServerBloc. The same listening port could be share between several plugin instance on different servers. So anyway there is already a mechanism to share the same listener for several instances of the same plugin.

For health I am not sure how that works with multiple KEY ... does it?
looks like it does not:

francois.com:5553 francois.fr:5553{
    health
 }

log of CoreDNS:

listen tcp :8080: bind: address already in use
Process finished with exit code 1

Chris O'Haver · Answer 6 · Wed Apr 17 2019 22:45:50 GMT+0800 (China Standard Time)

So it seems that at least until the plugin instance duplication for multi zone blocks is fixed, we will have to stick to a single stanza for now.

The question remains whether or not we should cease using fallthrough for reverse domains. It becomes a matter of choosing the lesser evil...

Continue to use fallthrough, and suffer performance impact for reverse zone queries of IPs that are not in the kubernetes cluster.
Stop using fallthrough, but still be authoritative for all reverse zones, thus disabling reverse zone queries of IPs that are not in the kubernetes cluster.
Stop using fallthrough, and do not be authoritative for any reverse zones, violating the k8s dns spec.

I think the lesser evil is 1. I don't think the performance impact is great enough to warrant going with 2 or 3.

Chris O'Haver · Answer 7 · Fri Jul 26 2019 01:57:23 GMT+0800 (China Standard Time)

per @johnbelamaric's experiments, it looks like client-go employs some form of connection sharing - so the memory thing may not be an issue. However it depends on which level the connections are shared (e.g. is there a single network connection, but multiple cache/stores, or is there a single cache store, etc). Further testing/experimentation required here.

John Belamaric · Answer 8 · Fri Jul 26 2019 02:08:04 GMT+0800 (China Standard Time)

I think just the connection is shared. Meaning the caches are duplicated.

Chris O'Haver · Answer 9 · Thu Apr 30 2020 23:37:34 GMT+0800 (China Standard Time)

I think just the connection is shared. Meaning the caches are duplicated.

I have more or less confirmed this when testing coredns/coredns#3862. With coredns/coredns#3862, the caches are not duplicated for each zone in the server block, and therefore the memory footprint drops (considerably).

John Belamaric · Answer 10 · Thu Apr 30 2020 23:45:02 GMT+0800 (China Standard Time)

Nice, I will try to review that soon.