reactor / reactor-netty

TCP/HTTP/UDP/QUIC client/server with Reactor over Netty

Home Page:https://projectreactor.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ongoing connection reset by peer

AbhiramDwivedi opened this issue · comments

commented

Two of our microservices running Spring Boot deployed on AWS EKS keep running into intermittent errors with "connection reset by peer"

We have already applied #1774 (comment) and actually used shorter timeouts and evictions, but it does not help.

The problem does not happen when invocations are made from react applications to Spring Boot server, or from Spring Boot clients to non-reactor based microservices. It is possible that the problem is in Infrastructure, but AWS refuses to accept. As an essence, this is hard to replicate outside of "our" environment, or outside of individual environments that others have used and faced this in.

Expected Behavior

The subscriber should validate a connection before it uses it. If this is not the default, this should at least be an option. SO, reactor-netty are flood with issues like this, going on for years, it only makes sense to provide code level option that would work across scenarios.

Actual Behavior

Intermittent error:
Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: recvAddress(..) failed: Connection reset by peer; nested exception is io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ? Request to GET http://application-URL [DefaultWebClient]
Original Stack Trace:
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)

Independent of this, the server has following logs, that may / may not be related:

  • Last HTTP packet was sent, terminating the channel
  • Channel inbound receiver cancelled (subscription disposed).

Steps to Reproduce

Unable to replicate outside of our environment. In our environment too, this happens only when calls are made between Spring Boot applications running in two different EKS clusters. It does not happen when applications are running in same EKS cluster.

Possible Solution

  • Validate a connection before using it, or
  • Provide an option to disable connection pool

Your Environment

Spring Boot applications running in two different EKS clusters.

  • Reactor version(s) used: projectreactor:reactor-core:jar:3.4.33
  • Other relevant libraries versions (eg. netty, ...): reactor-netty-core:jar:1.0.38
  • JVM version (java -version): openjdk version "11.0.22" 2024-01-16 LTS, OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
  • OS and version (eg. uname -a): rhel 8
  • Spring Boot : spring-boot-starter-webflux:jar:2.7.17

@AbhiramDwivedi Have you checked https://projectreactor.io/docs/netty/release/reference/index.html#faq.connection-closed, especially the part where a Network Component drops a connection silently.
If you have checked that, please provide the TCP dump.
You might be interested in checking this https://medium.com/tenable-techblog/lessons-from-aws-nlb-timeouts-5028a8f65dda (in case you use AWS NLB) and this https://youtu.be/O4oZS-SAq14?t=526

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

commented

Hi @violetagg : Those links were quite useful in understanding the TCP settings. We tried with pretty aggressive settings and that did not help. This is almost solved with TCP changes on target cluster, with following changes:

  • keep-alive reduced from 3000 to 300
  • keep-alive-requests increased from 100 to 1000
  • upstream-keepalive-timeout increased from 60 to 300
    However, we still run into it some times, and there is no real consistent way of reproducing or solving this.

A hundreds of million project was delayed due to this, and is now live with known intermittent issues. All network and dev teams have exhausted their capacities. Sometimes, its OK to move on than staying stuck to solve.

For a case like this, or other future cases, I would expect project developers to create an option to kill the pool and act as resttemplate. We are probably going to make that code change anyway at our end, and use two different ways of invoking endpoints.

This bug is not about "my" issue, but rather a permanent solution

@AbhiramDwivedi You changed the timeouts on the target but did you add any configuration on your client e.g. maxIdleTime as it is suggested in our FAQ?

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.