Routing fails randomly, version 0.10.x
tairila opened this issue · comments
Summary
I noticed that sometimes Kong routing to an API fails, this happens randomly. When trying to access an application through Kong the following error message comes to browser window “An unexpected error occurred". Earlier this was working fine with Kong version 0.9.7 and Cassandra 2.x.
[error] 126#0: *8877 [lua] responses.lua:101: before(): failed the initial dns/balancer resolve for 'xxx' with: dns query returned no results, client: xxx.xxx.xxx.xxx, server: kong, request: "GET /yyy HTTP/1.1", host: "xxx:8080"
The API creation command:
curl -X POST localhost:8001/apis/ -d 'name=xxx' -d 'upstream_url=http://xxx:8080' -d 'preserve_host=true' -d 'uris=/yyy' -d 'strip_uri=true'
Steps To Reproduce
Repeat GET request several times for an API.
Additional Details & Logs
Kong version 0.10.0 & 0.10.1
Cassandra 3.0.10
The message explains exactly what happens. Kong queries the dns server to resolve the hostname but does not receive a proper answer from that server.
As you can see here it will take the timeout
and attempts
settings from the resolv.conf
configuration file.
If they are not set, it will be 5 attempts and a timeout of 2 seconds.
The failed the initial dns/balancer resolve
message is generated here, whilst the dns query returned no results
is generated in the dns lib here, when the nameserver returns a record, but an empty one.
When Kong resolves a name it will try to resolve in the following order 'last-successful-type', SRV
, A
, AAAA
and finally CNAME
what do the DNS records look like, in that order?
It is following:
; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.4 <<>> mesos-ui.marathon.slave.mesos
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2190
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;mesos-ui.marathon.slave.mesos. IN A
;; ANSWER SECTION:
mesos-ui.marathon.slave.mesos. 60 IN A 10.254.4.45
;; Query time: 0 msec
;; SERVER: 10.254.20.255#53(10.254.20.255)
;; WHEN: Thu Mar 30 10:26:27 EEST 2017
;; MSG SIZE rcvd: 63
I noticed one thing with resolv.conf file though, the error comes when it has following nameservers:
nameserver 10.254.20.255
nameserver 10.254.20.175
nameserver 10.254.10.93
; generated by /usr/sbin/dhclient-script
search emea.xxx.net china.xxx.net apac.xxx.net americas.xxx.net
nameserver 10.131.39.252
nameserver 87.254.221.110
In this case only the first 3 are relevant ones and when I tested routing with having only those in resolv.conf file (removed everything else from it) it is working fine (no errors)!
interesting, I'd expect the resolver to pick the next nameserver on a retry, but maybe it doesn't and then fails while keep trying the same bad nameserver.
What is the response you get if you explcitly query those removed servers?
actually I don't think the resolv.conf
parser will honour the MAXNS
setting of 3. See https://linux.die.net/man/5/resolv.conf
That's probably why the bad nameserver was queried where it shouldn't have been.
fixed it in Kong/lua-resty-dns-client#7
Kong dependency needs to be updated after releasing new dns client version