Channel acquisition delay for a fraction of requests

Question

Channel acquisition delay for a fraction of requests

mukeshj13 opened this issue 9 months ago · comments

We have been facing an issue with Spring Cloud gateway with Netty where we are getting an intermittent delay before the channel acquisition (before the log "Channel acquired, now: "). The occurrences increase steadily along with the load, although the load is not enough to cause a spike in the gateway resource utilization. Most of the delayed requests see a delay of less than 5s, but some upto 20s. We are not using any load balancer either, just AWS DNS routing.

The max idle time configured is 5s with a fixed pool of 1000 max channels. The total concurrent requests being served are much lesser than the max connections. The horizontal scaling of the gateway pods reduces the issue but doesn't eliminate it. I am not able to replicate it either in my local setup as this is intermittent and increases with the load.

Expected Behavior

The delay shouldn't be there as there are available channels

Actual Behaviour

Here are the logs corresponding to a newly created channel

November 15th 2023, 18:01:13.027 [reactor-http-epoll-3]	[d782957f] Initialized pipeline DefaultChannelPipeline{(reactor.left.httpCodec = io.netty.handler.codec.http.HttpClientCodec), (reactor.right.reactiveBridge = reactor.netty.channel.ChannelOperationsHandler)}
November 15th 2023, 18:01:13.027 [reactor-http-epoll-3]	[d782957f] Created a new pooled channel, now: 2 active connections, 0 inactive connections and 0 pending acquire requests.	
November 15th 2023, 18:01:18.871 [reactor-http-epoll-3]	[d782957f] Connecting to [resource-inventory-host/[ip]:port].	
November 15th 2023, 18:01:18.874 [reactor-http-epoll-3]	[d782957f, L:/[ip]:port - R:resource-inventory-host/[ip]:port] Registering pool release on close event for channel	
November 15th 2023, 18:01:18.874 [reactor-http-epoll-3]	[d782957f, L:/[ip]:port - R:resource-inventory-host/[ip]:port] onStateChange(PooledConnection{channel=[id: 0xd782957f, L:/[ip]:port - R:resource-inventory-host/[ip]:port]}, [connected])	
November 15th 2023, 18:01:18.874 [reactor-http-epoll-3]	[d782957f-1, L:/[ip]:port - R:resource-inventory-host/[ip]:port] Handler is being applied: {uri=http://resource-inventory-host:port/v1/resources, method=GET}	
November 15th 2023, 18:01:18.874 [reactor-http-epoll-3]	[d782957f, L:/[ip]:port - R:resource-inventory-host/[ip]:port] Channel connected, now: 9 active connections, 3 inactive connections and 0 pending acquire requests.

Your Environment

reactor-netty 1.1.6
netty v 4.1.91.Final
spring-cloud-starter-gateway - 4.0.5 with Java 17 on Linux

Pierre De Rop · Answer 1 · Wed Nov 22 2023 02:37:39 GMT+0800 (China Standard Time)

So, in the logs provided, a delay of about 5 seconds is happening between the "Created a new pooled channel" log:

November 15th 2023, 18:01:13.027 [reactor-http-epoll-3]	[d782957f] Created a new pooled channel, now: 2 active connections, 0 inactive connections and 0 pending acquire requests.

and this log:

November 15th 2023, 18:01:18.871 [reactor-http-epoll-3]	[d782957f] Connecting to [resource-inventory-host/[ip]:port].

So, if I'm correct, what happens between the first log and the second log is DNS resolution.
I suggest the following:

First, can you update reactor netty to latest 1.1.13 and use latest netty version 4.1.101.Final, and see if the issue is resolved ?
Can you configure the reactor netty http client resolver logging, it might reveal long DNS queries, do something like this, and when you observe long delays, then check the resolver logs ("org.example.dns" in the below example):

HttpClient client = HttpClient.create(provider)
                ...
                .resolver(spec -> spec.trace("org.example.dns", LogLevel.DEBUG))

are you using a custom DNS query timeout in the reactor netty http client configuration ? (see https://projectreactor.io/docs/netty/release/reference/index.html#dns-timeout). If yes, then what is it's value ? If you are using a large DNS query timeout (> 20 seconds ?), then can you reduce it to let's say 3 seconds, maybe you will then observe many "Failed to resolve" exceptions instead of the 5s-20s of unepected delays:

HttpClient client = HttpClient.create(provider)
                ...
                .resolver(spec -> spec.queryTimeout(Duration.ofSeconds(3))

can you enable metrics for the client, and using the actuator, can you check this metric when you know that some delays have taken place (the below meter reports the time spent for resolving addresses):

actuator/metrics/reactor.netty.http.client.address.resolver

also, if I'm correct, when netty resolves addresses, it schedules tasks in some netty event loops. So if one of the event loops is blocked for whatever reasons, or if it's doing a lot of processing, then the DNS resolution will be delayed, so I suggest to check if there is no code in the gateway which is doing some kind of blocking operations, or even if maybe there are some FGC. Using the actuactor, you can check for example actuator/metrics/jvm.gc.pausemetric, or just use jstat command in order to check if some FGC are happening.

thanks.

Mukesh · Answer 2 · Wed Nov 22 2023 03:22:12 GMT+0800 (China Standard Time)

Thanks for the thorough analysis @pderop
I am using the default value for dns query time out which is 5s.

There aren't any blocking filters in the gateway and vertical scaling the gateway pods from 2 cores to 4 cores did not bring up all the event loops either (4 - 5 event loops at a time)

The metric reactor.netty.http.client.address.resolver showed significant spikes [attached screenshot], which is why I disabled the dns resolution temporarily by routing directly to the resolved cluster ip of the downstream Kubernetes service.
Post disabling dns routing, I could not see the reactor.netty.http.client.address.resolver metrics.
Still the execution time of the requests showed no notable signs of improvement.
We also checked nslookup and dig for hostnames from the gateway pods as well for 100 requests, which did not show any stuck requests either

jvm.gc.pause metrics are showing the sum of G1 evacuation pauses less than 2 seconds.

Another interesting metric I am observing is max http client connect time in the multiples of 5s, and it's sum is significantly high too.
MAX:

SUM:

In very few requests, I am observing a delay in receiving the response as well and the delay is almost always in multiples of 5s (~5s and ~10s).
Quite a brain teaser..

As per your suggestion, I'll enable resolver logs, but upgrading versions might take a little time.
Is there anything else which I could check? as I am starting to think now that this might not be related to DNS lookup.

Pierre De Rop · Answer 3 · Wed Nov 22 2023 18:39:28 GMT+0800 (China Standard Time)

so, you have disabled DNS, but you still observe response delays for few requests, correct ?

To address the lingering delays despite DNS being disabled, let's focus on the connection establishment time. The metric actuator/metrics/reactor.netty.http.client.connect.time also reveals significant spikes, indicating occasional delays even excluding DNS resolve time.

I suggest two tests to isolate the issue:

Increase Connection Idle Time: Currently, a 5-second idle time leads to frequent connection closures and re-establishments. Try increasing this value substantially. By extending the idle timer, check if the delays persist without DNS resolution enabled. Are you then still encountering delays of 10-20 seconds in a few requests?
Adjust Connection Pool Size: If delays persist after extending the idle time, consider reducing the size of the connection pool. Match it closely to the number of simultaneous incoming requests. For instance, if you anticipate around 100 simultaneous requests, configure the pool to match this number. Evaluate if this adjustment impacts the delays significantly. The goal here is to test if reducing the pool with large idle timer will reduce reconnections and will mitigates the unexpected large delays for the few responses.

If these adjustments resolve the large delays, it's likely that the issue lies in the connection establishment time. Delays of 10-20 seconds during connection setup should prompt an investigation into the underlying network infrastructure.

If no more delays are observed, then re-enable DNS resolution (do not route directly to the resolved cluster ip of the downstream Kubernetes service). Do you then still observe DNS resolution large delays ?

Mukesh · Answer 4 · Wed Nov 22 2023 21:11:44 GMT+0800 (China Standard Time)

Disabled the connection pool, a lot of request delays disappeared, but some requests are still seeing delays in multiples of 5s while forwarding requests. And sometimes 5s delays for both the inbound and outbound routes (meaning total of 10s delay or more)

Pierre De Rop · Answer 5 · Wed Nov 22 2023 22:18:32 GMT+0800 (China Standard Time)

So, we have excluded delays for DNS resolve time , as well as reconnection delays.

two questions:

When you say that you are still seeing delays while forwarding requests, you mean that few requests are received by the gateway, but are forwarded to the upstream server with some delays;
am I understanding correctly ?
Can you show the values for the following metrics:

reactor.netty.http.client.data.received.time:
reactor.netty.http.client.data.sent.time:
reactor.netty.http.client.response.time

Mukesh · Answer 6 · Fri Nov 24 2023 17:17:21 GMT+0800 (China Standard Time)

Yes, the requests are received by the gateway and forwarded through the gateway proxy client with delay. Initially 10 - 20% requests were seeing delays more than 1s. When connection pool was disabled, the random delays got mitigated but the delays that stayed were very close to multiples of 5 seconds.

Received time

Sent time

Response time

So now, after analysing the event loop stack traces in thread dump, I believe I have found the cause of the multiples of 5s of delays. In my requests logging filters, I am triggering the reverse lookup for the hostname.

String host = request.getRemoteAddress().getHostName();

In each of the thread dumps I could see atleast one event loop in the method which triggers the reverse lookup

"reactor-http-epoll-1" - Thread t@58
   java.lang.Thread.State: RUNNABLE
	at java.base@17.0.9/java.net.Inet6AddressImpl.getHostByAddr(Native Method)
	at java.base@17.0.9/java.net.InetAddress$PlatformNameService.getHostByAddr(InetAddress.java:940)
	at java.base@17.0.9/java.net.InetAddress.getHostFromNameService(InetAddress.java:662)
	at java.base@17.0.9/java.net.InetAddress.getHostName(InetAddress.java:605)
	at java.base@17.0.9/java.net.InetAddress.getHostName(InetAddress.java:577)
	at java.base@17.0.9/java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
	at java.base@17.0.9/java.net.InetSocketAddress.getHostName(InetSocketAddress.java:366)

So I disabled the request/response logging and it resolved the issue, after that I couldn't see any delays.

This gets a little hard to digest that this lookup is causing issues in the reactor proxy http client connections to the downstream. Also the request log filter runs early on in the filter chain and I don't see any delay in the execution of this filter, so it must be causing issues at the network level for concurrent requests flowing through the gateway. Could there be any issues with Ipv6 hostname resolution causing application to face issues in connection establishment for the requests currently being forwarded? Its quite a misleading behaviour to encounter.

Pierre De Rop · Answer 7 · Fri Nov 24 2023 18:16:29 GMT+0800 (China Standard Time)

it's great that disabling the reverse DNS lookup resolved the problem. Netty is scheduling many tasks in event loops, (for example, to handle readability/writability socket events, etc ...), and blocking loops with IO-bound/blocking tasks can lead to significant performance issues.

While the execution of the filter might not show apparent delays, since filters are executed within event loops, then while blocking on the reverse lookup, then netty won't be able to schedule any tasks in the blocked loop until the DNS resolve is done.

It's wise to consider running BlockHound to validate that no other code in the application inadvertently blocks the event loops. Regarding IPv6 hostname resolution, while it's a possibility, the core concern remains about ensuring that no operation inadvertently blocks the event loops.

thanks.

Pierre De Rop · Answer 8 · Fri Nov 24 2023 18:17:40 GMT+0800 (China Standard Time)

if you do agree, I will close this issue, thanks.