coredns / coredns

CoreDNS is a DNS server that chains plugins

Home Page:https://coredns.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CoreDNS(local dns) is crashed due to high concurrency stress traffic

zengyuxing007 opened this issue · comments

What happened:

CoreDNS is crashed due to high concurrency stress traffic.

panic: too many concurrent operations on a single file or socket (max 1048575)

The plugin loop is configured , and there is no loop-type dns request query.

It just seems like a lot of dns traffic. there may be more out-of-cluster domain name queries. (Ref Issue: istio/istio#50119)

The More Logs :

[ERROR] Recovered from panic in server: "dns://:53" too many concurrent operations on a single file or socket (max 1048575)
[ERROR] Recovered from panic in server: "dns://:53" too many concurrent operations on a single file or socket (max 1048575)
[ERROR] Recovered from panic in server: "dns://:53" too many concurrent operations on a single file or socket (max 1048575)
[ERROR] Recovered from panic in server: "dns://:53" too many concurrent operations on a single file or socket (max 1048575)
panic: too many concurrent operations on a single file or socket (max 1048575) [recovered]
	panic: too many concurrent operations on a single file or socket (max 1048575)

goroutine 3314264 [running]:
internal/poll.(*fdMutex).rwlock(0xc0007bc900, 0xc0?)
	/usr/local/go/src/internal/poll/fd_mutex.go:147 +0x11b
internal/poll.(*FD).writeLock(...)
	/usr/local/go/src/internal/poll/fd_mutex.go:239
internal/poll.(*FD).WriteMsg(0xc0007bc900, {0xc2e5aa2ba0, 0x22, 0x23}, {0xc2e5adc0c0, 0x20, 0x20}, {0x2365f60, 0xc2e549f940})
	/usr/local/go/src/internal/poll/fd_unix.go:518 +0xa5
net.(*netFD).writeMsg(0xc0007bc900, {0xc2e5aa2ba0?, 0x20?, 0x24?}, {0xc2e5adc0c0?, 0x20?, 0xc2e5adc0d0?}, {0x2365f60?, 0xc2e549f940?})
	/usr/local/go/src/net/fd_posix.go:120 +0x45
net.(*UDPConn).writeMsg(0xc00000e508, {0xc2e5aa2ba0, 0x22, 0x23}, {0xc2e5adc0c0, 0x20, 0x20}, 0x6fe1b5?)
	/usr/local/go/src/net/udpsock_posix.go:176 +0xd6
net.(*UDPConn).WriteMsgUDP(0xc00000e508, {0xc2e5aa2ba0?, 0x0?, 0xc2e5af8601?}, {0xc2e5adc0c0?, 0x0?, 0x0?}, 0xc2e6614900)
	/usr/local/go/src/net/udpsock.go:258 +0x47
github.com/miekg/dns.WriteToSessionUDP(0x6c6c33?, {0xc2e5aa2ba0, 0x22, 0x23}, 0xc2e4e13d20)
	/build/vendor/github.com/miekg/dns/udp.go:52 +0x71
github.com/miekg/dns.(*response).Write(0xc2e5af86e0?, {0xc2e5aa2ba0?, 0x72ae8d?, 0xc2e5af8718?})
	/build/vendor/github.com/miekg/dns/server.go:756 +0x4b
github.com/miekg/dns.(*response).WriteMsg(0xc2e54e1880, 0xc2e5ad10e0?)
	/build/vendor/github.com/miekg/dns/server.go:743 +0x8e
github.com/coredns/coredns/request.(*ScrubWriter).WriteMsg(0xc2e5ae0150, 0x2382ed8?)
	/build/request/writer.go:20 +0x98
github.com/coredns/coredns/core/dnsserver.errorAndMetricsFunc({0xc000798815, 0x9}, {0x2382ed8, 0xc2e5ae0150}, 0xc2e5ad0f30, 0x2)
	/build/core/dnsserver/server.go:340 +0x268
github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS.func1()
	/build/core/dnsserver/server.go:225 +0x1ef
panic({0x1b72d60, 0x235f8e0})
	/usr/local/go/src/runtime/panic.go:838 +0x207
internal/poll.(*fdMutex).rwlock(0xc0007bc900, 0xc0?)
	/usr/local/go/src/internal/poll/fd_mutex.go:147 +0x11b
internal/poll.(*FD).writeLock(...)
	/usr/local/go/src/internal/poll/fd_mutex.go:239
internal/poll.(*FD).WriteMsg(0xc0007bc900, {0xc2e5ad20f0, 0x42, 0x43}, {0xc2e5adc0a0, 0x20, 0x20}, {0x2365f60, 0xc2e549f900})
	/usr/local/go/src/internal/poll/fd_unix.go:518 +0xa5
net.(*netFD).writeMsg(0xc0007bc900, {0xc2e5ad20f0?, 0x20?, 0x24?}, {0xc2e5adc0a0?, 0x20?, 0xc2e5adc0b0?}, {0x2365f60?, 0xc2e549f900?})
	/usr/local/go/src/net/fd_posix.go:120 +0x45
net.(*UDPConn).writeMsg(0xc00000e508, {0xc2e5ad20f0, 0x42, 0x43}, {0xc2e5adc0a0, 0x20, 0x20}, 0x6fe1b5?)
	/usr/local/go/src/net/udpsock_posix.go:176 +0xd6
net.(*UDPConn).WriteMsgUDP(0xc00000e508, {0xc2e5ad20f0?, 0x0?, 0x1?}, {0xc2e5adc0a0?, 0x0?, 0x0?}, 0xc2e6614900)
	/usr/local/go/src/net/udpsock.go:258 +0x47
github.com/miekg/dns.WriteToSessionUDP(0x6c6c33?, {0xc2e5ad20f0, 0x42, 0x43}, 0xc2e4e13d20)
	/build/vendor/github.com/miekg/dns/udp.go:52 +0x71
github.com/miekg/dns.(*response).Write(0x22?, {0xc2e5ad20f0?, 0x72ae8d?, 0xc2e5af0e68?})
	/build/vendor/github.com/miekg/dns/server.go:756 +0x4b
github.com/miekg/dns.(*response).WriteMsg(0xc2e54e1880, 0xc2e5ad1050?)
	/build/vendor/github.com/miekg/dns/server.go:743 +0x8e
github.com/coredns/coredns/request.(*ScrubWriter).WriteMsg(0xc2e5ae0150, 0x32263e0?)
	/build/request/writer.go:20 +0x98
github.com/coredns/coredns/plugin/pkg/dnstest.(*Recorder).WriteMsg(0xc2e5ac53c0, 0xc2e5ad1050)
	/build/plugin/pkg/dnstest/recorder.go:44 +0x68
github.com/coredns/coredns/plugin/metrics.(*Recorder).WriteMsg(0xc2e5ac5400, 0xc2e5acc310?)
	/build/plugin/metrics/recorder.go:27 +0xb3
github.com/coredns/coredns/plugin/pkg/dnstest.(*Recorder).WriteMsg(0xc2e5ac5440, 0xc2e5ad1050)
	/build/plugin/pkg/dnstest/recorder.go:44 +0x68
github.com/coredns/coredns/plugin/loadbalance.(*RoundRobinResponseWriter).WriteMsg(0xc2e5acc2f0, 0xc2e5ad1050)
	/build/plugin/loadbalance/loadbalance.go:25 +0x13a
github.com/coredns/coredns/plugin/cache.(*Cache).ServeDNS(0xc0004c8630, {0x237dea0?, 0xc2e5ac6ae0}, {0x2382cc8, 0xc2e5acc2f0}, 0x0?)
	/build/plugin/cache/handler.go:66 +0x99e
github.com/coredns/coredns/plugin.NextOrFailure({0x1f3ba10?, 0x4179db?}, {0x2369620, 0xc0004c8630}, {0x237dea0, 0xc2e5ac6ae0}, {0x2382cc8, 0xc2e5acc2f0}, 0xc1fea3e3f8?)
	/build/plugin/plugin.go:80 +0x26a
github.com/coredns/coredns/plugin/loadbalance.RoundRobin.ServeDNS({{0x2369620?, 0xc0004c8630?}}, {0x237dea0, 0xc2e5ac6ae0}, {0x2382dd0?, 0xc2e5ac5440}, 0x9?)
	/build/plugin/loadbalance/handler.go:20 +0xcf
github.com/coredns/coredns/plugin.NextOrFailure({0x1f30f87?, 0x1?}, {0x236f2a0, 0xc000794600}, {0x237dea0, 0xc2e5ac6ae0}, {0x2382dd0, 0xc2e5ac5440}, 0x32aa701?)
	/build/plugin/plugin.go:80 +0x26a
github.com/coredns/coredns/plugin/log.Logger.ServeDNS({{0x236f2a0, 0xc000794600}, {0xc0003d9020, 0x1, 0x1}, {}}, {0x237dea0, 0xc2e5ac6ae0}, {0x2382d20?, 0xc2e5ac5400?}, ...)
	/build/plugin/log/log.go:36 +0x346
github.com/coredns/coredns/plugin.NextOrFailure({0x1f33b05?, 0xc1fea3eda8?}, {0x236f2f0, 0xc0007abaa0}, {0x237dea0, 0xc2e5ac6ae0}, {0x2382d20, 0xc2e5ac5400}, 0x0?)
	/build/plugin/plugin.go:80 +0x26a
github.com/coredns/coredns/plugin/errors.(*errorHandler).ServeDNS(0xc0003d8fc0, {0x237dea0?, 0xc2e5ac6ae0?}, {0x2382d20, 0xc2e5ac5400}, 0xc2e5ad0f30)
	/build/plugin/errors/errors.go:84 +0x87
github.com/coredns/coredns/plugin.NextOrFailure({0x1f3a39b?, 0x1?}, {0x23696c0, 0xc0003d8fc0}, {0x237dea0, 0xc2e5ac6ae0}, {0x2382d20, 0xc2e5ac5400}, 0x0?)
	/build/plugin/plugin.go:80 +0x26a
github.com/coredns/coredns/plugin/metrics.(*Metrics).ServeDNS(0xc0003b65a0, {0x237dea0, 0xc2e5ac6ae0}, {0x2382ed8?, 0xc2e5ae0150}, 0xc2e5ad0f30)
	/build/plugin/metrics/handler.go:27 +0x255
github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS(0xc000140060, {0x237dea0, 0xc2e5ac6ae0}, {0x2382ed8, 0xc2e5ae0150}, 0xc2e5ad0f30)
	/build/core/dnsserver/server.go:287 +0x64d
github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1({0x23849a8, 0xc2e54e1880}, 0xc2e5ae0138?)
	/build/core/dnsserver/server.go:131 +0x9b
github.com/miekg/dns.HandlerFunc.ServeDNS(0xc192fa0a00?, {0x23849a8?, 0xc2e54e1880?}, 0xc2e5ad0f30?)
	/build/vendor/github.com/miekg/dns/server.go:37 +0x2f
github.com/miekg/dns.(*Server).serveDNS(0xc000981b00, {0xc192fa0a00, 0x22, 0x200}, 0xc2e54e1880)
	/build/vendor/github.com/miekg/dns/server.go:659 +0x451
github.com/miekg/dns.(*Server).serveUDPPacket(0xc000981b00, 0xc01c6d72e0?, {0xc192fa0a00, 0x22, 0x200}, {0x2382538?, 0xc00000e508}, 0xc2e4e13d20, {0x0, 0x0})
	/build/vendor/github.com/miekg/dns/server.go:603 +0x1dc
created by github.com/miekg/dns.(*Server).serveUDP
	/build/vendor/github.com/miekg/dns/server.go:533 +0x485
panic: too many concurrent operations on a single file or socket (max 1048575) [recovered]
	panic: too many concurrent operations on a single file or socket (max 1048575)

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • the version of CoreDNS:
    v1.7
  • Corefile:
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        cache {
          success 9984 30
          denial 9984 5
        }

        reload
        loop
        bind 169.254.20.10
        forward . 172.16.0.10 {
          force_tcp
        }
        prometheus :9253
        health 169.254.20.10:8081
    }
kind: ConfigMap
  • logs, if applicable:
  • OS (e.g: cat /etc/os-release):
  • Others:

Node Local DNS (using CoreDNS core)

1048575 Looks like 200000 concurrency + 5s timeout to upstream DNS Server may trigger ??

https://github.com/coredns/coredns/blob/master/plugin/forward/setup.go#L279

Should we set a default value for the max_concurrent parameter even when it's not be configured? so we can avoid triggering this panic in Golang .

the version of CoreDNS: v1.7

Can you test on a more up to date version?
1.7 is very old.

https://github.com/coredns/coredns/blob/master/plugin/forward/setup.go#L279

Should we set a default value for the max_concurrent parameter even when it's not be configured? so we can avoid triggering this panic in Golang .

I wouldn’t think so. It appears the issue is related to accepting concurrent incoming connections to the server, not when forward makes connections to upstream servers.

I wouldn’t think so. It appears the issue is related to accepting concurrent incoming connections to the server, not when forward makes connections to upstream servers.

More precisely, when writing responses to those incoming requests. IIUC, the limit can be reached regardless of whether the forward plugin is in use, so while setting max concurrent in forward might help indirectly, it’s not a good over all solution.

@chrisohaver Thanks for your reply~

About the CoreDNS version
In the real environment, local dns and coredns are deployed. and all of them are crashed probabilistic. They all crashed occasionally.

And the stack(golang) informations seen on coredns reboot is the same as above. same root cause: panic: too many concurrent operations on a single file or socket (max 1048575)

I did some troubleshooting, and the reason is that istio has a lot of serviceEntry configured, and there are a lot of sidecars under the cluster, which regularly and concurrently launch some dns queries for external domains every 30s.

The DNS Query Path:

Sidecars (many) -> Node Local DNS (go.mod used the v1.7 CoreDNS) -> CoreDNS (1.9) -> Upstream DNS Server

It appears that Node Local DNS / CoreDNS has accumulated many DNS requests, all of which are waiting for responses from the upstream.

You can set max concurrent in forward in your configuration to mitigate your issue. But I don’t agree that it’s good to make it the default.

You can set max concurrent in forward in your configuration to mitigate your issue. But I don’t agree that it’s good to make it the default.

My idea is that we can set the max_concurret default value to 1048575 -1. That maybe won't trigger the underlying restriction in the Go language, and it would not cause CoreDNS process to panic and the pod to restart.

It would have no effect on dns responses not flowing through the forward plugin.

I think this would need to be fixed in miekg/dns if it hasn’t already been fixed in the many years since coredns 1.7 was released.

@zengyuxing007, can you replicate this error using the latest coredns release?

@zengyuxing007, can you replicate this error using the latest coredns release?

@chrisohaver I tried to reproduce it and didn't, there might be some harsh conditions

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days