reactor / reactor-netty

TCP/HTTP/UDP/QUIC client/server with Reactor over Netty

Home Page:https://projectreactor.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Max active streams reached exception

nicsam1997 opened this issue · comments

Expected Behavior

When a connection observes an exception it should stopped being used

Actual Behavior

The bad connection remains open for a long time until it is eventually closed

Steps to Reproduce

Cannot reproduce, have only seen it in production. Have attached logs below.
logs_reactor_netty_remove_connection.txt
logs_reactor_netty_start_of_issue.txt
logs_reactor_netty_streams_increasing.txt

Your Environment

  • Reactor version(s) used: spring-boot-starter-parent:3.1.5 spring-boot-start-reactor-netty:3.1.5, reactor-netty-http:1.1.12
  • JVM version (java -version): 17
  • OS and version (eg. uname -a):

@nicsam1997 Are there other logs from reactor.netty.http.client.Http2Pool logger?

@nicsam1997 In Gitter thread you mentioned ReadTimeoutExceptions . Can you add also the configuration that you have in order to receive ReadTimeoutExceptions?
https://matrix.to/#/!rdZMIWMDXqzVEagqCt:gitter.im/$wavvZKbYc5hXnehfGjiExvduwSMa746rQH8ONG2QBWQ?via=gitter.im&via=matrix.org&via=mozilla.org

The configuration is like this: We have a responseTimeout of 4s

final var httpClient =
    HttpClient.create(providerBuilder.build())
        .protocol(config.http2Enabled() ? HttpProtocol.H2 : HttpProtocol.HTTP11)
        .responseTimeout(config.responseTimeout())
        .option(
            CONNECT_TIMEOUT_MILLIS,
            Long.valueOf(config.connectTimeout().toMillis()).intValue());
httpClient.warmup().block();
final var contextSpec =
  Http2SslContextSpec.forClient()
    .configure(
      builder ->
        configureKeyManager(keyStoreFullPath, mtlsConfig.keyStorePassword(), builder));
return httpClient.secure(t -> t.sslContext(contextSpec));

There are logs for the Http2Pool logger towards the end of the issue, I have filtered out the logs containing "Channel deactivated" or "Channel activated" There are no other logs for that logger
logs_rector_netty_include_errors.txt

Hi @violetagg Have now debugged this issue a bit further and managed to reproduce. Basically it seems like the ReadTimeOutException is thrown every time because the server never responds fast enough but this does not lead to the connection being closed. I am not sure if this is intended or not but would want to customize the behaviour, basically as soon as I observe a ReadTimeoutException I want to dispose the used connection. Is this possible?

@nicsam1997 Are you able to share this repro?

Yes, but probably tomorrow, need to clean it up a bit, have not managed to get the max active streams reached, just seen that the connection does not get disposed despite observing ReadTimeoutException

@violetagg Edited above

Hi, @violetagg have created a reproduction here: https://github.com/nicsam1997/http2reproduce Hopefully it is clear, if something is not working well please reach out. Note that there is a solution for my issue that is implemented for linux, using an EpollChannelOption. Basically what it shows is that if the connection is broken (in this case because of a firewall rule being added) it does not get thrown away. If you add the same EpollChannelOption as I did here it does get thrown away, so there is a work around.

@nicsam1997 IMO what you observe with this reproducible example is a bit different. Let me explain: based only on response timeout, we cannot consider that the connection itself is broken (HTTP/2 use case -> ReadTimeoutException is received on the level of stream). Reactor Netty gives you the ability to specify response timeout per request, which means different requests might have different timeouts (of course you may also specify one global/default response timeout that will apply to all requests, which is the case here). Some calls might reach their timeout if they request some slow resources on the server, but other might not reach their timeouts if the server is able to respond within the timeout.

Your use case is more in the group of network component drops silently the connection/packets etc. (firewall/loadbalancer etc.). For these kind of issues you can enable SO_KEEPALIVE, more about this you can find in the reference documentation:
https://projectreactor.io/docs/netty/release/reference/index.html#faq.connection-closed
https://projectreactor.io/docs/netty/release/reference/index.html#connection-timeout

I added this configuration to your example:

        final var httpClient =
                HttpClient.create(connectionProvider)
                        .protocol(HttpProtocol.H2C)
                        .responseTimeout(Duration.ofSeconds(4))
                        // SO_KEEPALIVE configuration start
                        .option(ChannelOption.SO_KEEPALIVE, true)
                        .option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPIDLE), 1)
                        .option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPINTERVAL), 1)
                        .option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPCOUNT), 8)
                        // SO_KEEPALIVE configuration end
                        .option(CONNECT_TIMEOUT_MILLIS, 50)
                        .option(EpollChannelOption.TCP_USER_TIMEOUT, 3000);

Keep-alive does not help me get rid of a busy connection, right? Ideally the connection would not break but if it does, would this setting do anything to remove it from the pool (given that I already have traffic on the connection?)

As I wrote I think the reproducible example is a bit different - there SO_KEEPALIVE helps because you have firewall dropping packages.

Okey I understand! The firewall part was simply a way of me making the connection go bad, I suppose there could be other ways. In general I need to figure out that given a connection that is bad, how can I remove it. For example if I had a sticky load balancer and one slow server on the other side

Ideally I would want to mark the connection as “not reusable” if it observes a timeout

@nicsam1997 Based on the provided logs and a code review, I found one issue (fixed with #3031). Will you be able to test the current snapshots? (1.0.42-SNAPSHOT/1.1.16-SNAPSHOT)

Possibly, but what does this bugfix fix? I have not seen the max active streams exception for a long time now so that will be hard to test

It tries to fix the reported problem?

Immediately aborted pooled channel, max active streams is reached, re-acquiring a new channel
Error while acquiring from reactor.netty.http.client.Http2Pool@41130dcb. Max active streams is reached.

Since I don’t have any other way to reproduce but by running in production and wait for a long time I think testing the snapshot unfortunately will be a bit hard

@nicsam1997 I'm going to close this issue as I think that #3031 should fix it. The fix will be available in 1.0.42/1.1.16 (the planned date for these releases is 13.02). When the release is available and you are able to upgrade, please test it. If the problem is still there we can reopen this issue.