Ongoing connection reset by peer

Question

Ongoing connection reset by peer

AbhiramDwivedi opened this issue 5 months ago · comments

Two of our microservices running Spring Boot deployed on AWS EKS keep running into intermittent errors with "connection reset by peer"

We have already applied #1774 (comment) and actually used shorter timeouts and evictions, but it does not help.

The problem does not happen when invocations are made from react applications to Spring Boot server, or from Spring Boot clients to non-reactor based microservices. It is possible that the problem is in Infrastructure, but AWS refuses to accept. As an essence, this is hard to replicate outside of "our" environment, or outside of individual environments that others have used and faced this in.

Expected Behavior

The subscriber should validate a connection before it uses it. If this is not the default, this should at least be an option. SO, reactor-netty are flood with issues like this, going on for years, it only makes sense to provide code level option that would work across scenarios.

Actual Behavior

Intermittent error:
Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: recvAddress(..) failed: Connection reset by peer; nested exception is io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ? Request to GET http://application-URL [DefaultWebClient]
Original Stack Trace:
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)

Independent of this, the server has following logs, that may / may not be related:

Last HTTP packet was sent, terminating the channel
Channel inbound receiver cancelled (subscription disposed).

Steps to Reproduce

Unable to replicate outside of our environment. In our environment too, this happens only when calls are made between Spring Boot applications running in two different EKS clusters. It does not happen when applications are running in same EKS cluster.

Possible Solution

Validate a connection before using it, or
Provide an option to disable connection pool

Your Environment

Spring Boot applications running in two different EKS clusters.

Reactor version(s) used: projectreactor:reactor-core:jar:3.4.33
Other relevant libraries versions (eg. netty, ...): reactor-netty-core:jar:1.0.38
JVM version (java -version): openjdk version "11.0.22" 2024-01-16 LTS, OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
OS and version (eg. uname -a): rhel 8
Spring Boot : spring-boot-starter-webflux:jar:2.7.17

Violeta Georgieva · Answer 1 · Thu Apr 18 2024 05:01:27 GMT+0800 (China Standard Time)

@AbhiramDwivedi Have you checked https://projectreactor.io/docs/netty/release/reference/index.html#faq.connection-closed, especially the part where a Network Component drops a connection silently.
If you have checked that, please provide the TCP dump.
You might be interested in checking this https://medium.com/tenable-techblog/lessons-from-aws-nlb-timeouts-5028a8f65dda (in case you use AWS NLB) and this https://youtu.be/O4oZS-SAq14?t=526

github-actions · Answer 2 · Thu Apr 25 2024 14:26:43 GMT+0800 (China Standard Time)

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

Ram · Answer 3 · Sat Apr 27 2024 05:24:15 GMT+0800 (China Standard Time)

Hi @violetagg : Those links were quite useful in understanding the TCP settings. We tried with pretty aggressive settings and that did not help. This is almost solved with TCP changes on target cluster, with following changes:

keep-alive reduced from 3000 to 300
keep-alive-requests increased from 100 to 1000
upstream-keepalive-timeout increased from 60 to 300
However, we still run into it some times, and there is no real consistent way of reproducing or solving this.

A hundreds of million project was delayed due to this, and is now live with known intermittent issues. All network and dev teams have exhausted their capacities. Sometimes, its OK to move on than staying stuck to solve.

For a case like this, or other future cases, I would expect project developers to create an option to kill the pool and act as resttemplate. We are probably going to make that code change anyway at our end, and use two different ways of invoking endpoints.

This bug is not about "my" issue, but rather a permanent solution

Violeta Georgieva · Answer 4 · Tue Apr 30 2024 13:11:55 GMT+0800 (China Standard Time)

@AbhiramDwivedi You changed the timeouts on the target but did you add any configuration on your client e.g. maxIdleTime as it is suggested in our FAQ?

github-actions · Answer 5 · Tue May 07 2024 14:27:54 GMT+0800 (China Standard Time)

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

github-actions · Answer 6 · Tue May 14 2024 14:28:22 GMT+0800 (China Standard Time)

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.