Max active streams reached exception

Question

Max active streams reached exception

nicsam1997 opened this issue 8 months ago · comments

nicsam1997 commented 8 months ago

Expected Behavior

When a connection observes an exception it should stopped being used

Actual Behavior

The bad connection remains open for a long time until it is eventually closed

Steps to Reproduce

Cannot reproduce, have only seen it in production. Have attached logs below.
logs_reactor_netty_remove_connection.txt
logs_reactor_netty_start_of_issue.txt
logs_reactor_netty_streams_increasing.txt

Your Environment

Reactor version(s) used: spring-boot-starter-parent:3.1.5 spring-boot-start-reactor-netty:3.1.5, reactor-netty-http:1.1.12
JVM version (java -version): 17
OS and version (eg. uname -a):

Violeta Georgieva · Answer 1 · Thu Nov 23 2023 18:26:04 GMT+0800 (China Standard Time)

@nicsam1997 Are there other logs from reactor.netty.http.client.Http2Pool logger?

Violeta Georgieva · Answer 2 · Thu Nov 23 2023 18:28:19 GMT+0800 (China Standard Time)

@nicsam1997 In Gitter thread you mentioned ReadTimeoutExceptions . Can you add also the configuration that you have in order to receive ReadTimeoutExceptions?
https://matrix.to/#/!rdZMIWMDXqzVEagqCt:gitter.im/$wavvZKbYc5hXnehfGjiExvduwSMa746rQH8ONG2QBWQ?via=gitter.im&via=matrix.org&via=mozilla.org

nicsam1997 · Answer 3 · Thu Nov 23 2023 19:50:55 GMT+0800 (China Standard Time)

The configuration is like this: We have a responseTimeout of 4s

final var httpClient =
    HttpClient.create(providerBuilder.build())
        .protocol(config.http2Enabled() ? HttpProtocol.H2 : HttpProtocol.HTTP11)
        .responseTimeout(config.responseTimeout())
        .option(
            CONNECT_TIMEOUT_MILLIS,
            Long.valueOf(config.connectTimeout().toMillis()).intValue());
httpClient.warmup().block();
final var contextSpec =
  Http2SslContextSpec.forClient()
    .configure(
      builder ->
        configureKeyManager(keyStoreFullPath, mtlsConfig.keyStorePassword(), builder));
return httpClient.secure(t -> t.sslContext(contextSpec));

nicsam1997 · Answer 4 · Thu Nov 23 2023 19:54:57 GMT+0800 (China Standard Time)

There are logs for the Http2Pool logger towards the end of the issue, I have filtered out the logs containing "Channel deactivated" or "Channel activated" There are no other logs for that logger
logs_rector_netty_include_errors.txt

nicsam1997 · Answer 5 · Thu Dec 14 2023 02:58:20 GMT+0800 (China Standard Time)

Hi @violetagg Have now debugged this issue a bit further and managed to reproduce. Basically it seems like the ReadTimeOutException is thrown every time because the server never responds fast enough but this does not lead to the connection being closed. I am not sure if this is intended or not but would want to customize the behaviour, basically as soon as I observe a ReadTimeoutException I want to dispose the used connection. Is this possible?

Violeta Georgieva · Answer 6 · Thu Dec 14 2023 03:11:53 GMT+0800 (China Standard Time)

@nicsam1997 Are you able to share this repro?

nicsam1997 · Answer 7 · Thu Dec 14 2023 03:14:54 GMT+0800 (China Standard Time)

Yes, but probably tomorrow, need to clean it up a bit, have not managed to get the max active streams reached, just seen that the connection does not get disposed despite observing ReadTimeoutException

nicsam1997 · Answer 8 · Thu Dec 14 2023 03:20:01 GMT+0800 (China Standard Time)

@violetagg Edited above

nicsam1997 · Answer 9 · Thu Dec 28 2023 23:00:25 GMT+0800 (China Standard Time)

Hi, @violetagg have created a reproduction here: https://github.com/nicsam1997/http2reproduce Hopefully it is clear, if something is not working well please reach out. Note that there is a solution for my issue that is implemented for linux, using an EpollChannelOption. Basically what it shows is that if the connection is broken (in this case because of a firewall rule being added) it does not get thrown away. If you add the same EpollChannelOption as I did here it does get thrown away, so there is a work around.

Violeta Georgieva · Answer 10 · Fri Jan 05 2024 21:10:35 GMT+0800 (China Standard Time)

@nicsam1997 IMO what you observe with this reproducible example is a bit different. Let me explain: based only on response timeout, we cannot consider that the connection itself is broken (HTTP/2 use case -> ReadTimeoutException is received on the level of stream). Reactor Netty gives you the ability to specify response timeout per request, which means different requests might have different timeouts (of course you may also specify one global/default response timeout that will apply to all requests, which is the case here). Some calls might reach their timeout if they request some slow resources on the server, but other might not reach their timeouts if the server is able to respond within the timeout.

Your use case is more in the group of network component drops silently the connection/packets etc. (firewall/loadbalancer etc.). For these kind of issues you can enable SO_KEEPALIVE, more about this you can find in the reference documentation:
https://projectreactor.io/docs/netty/release/reference/index.html#faq.connection-closed
https://projectreactor.io/docs/netty/release/reference/index.html#connection-timeout

I added this configuration to your example:

        final var httpClient =
                HttpClient.create(connectionProvider)
                        .protocol(HttpProtocol.H2C)
                        .responseTimeout(Duration.ofSeconds(4))
                        // SO_KEEPALIVE configuration start
                        .option(ChannelOption.SO_KEEPALIVE, true)
                        .option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPIDLE), 1)
                        .option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPINTERVAL), 1)
                        .option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPCOUNT), 8)
                        // SO_KEEPALIVE configuration end
                        .option(CONNECT_TIMEOUT_MILLIS, 50)
                        .option(EpollChannelOption.TCP_USER_TIMEOUT, 3000);

nicsam1997 · Answer 11 · Wed Jan 10 2024 15:20:11 GMT+0800 (China Standard Time)

Keep-alive does not help me get rid of a busy connection, right? Ideally the connection would not break but if it does, would this setting do anything to remove it from the pool (given that I already have traffic on the connection?)

Violeta Georgieva · Answer 12 · Wed Jan 10 2024 15:44:59 GMT+0800 (China Standard Time)

As I wrote I think the reproducible example is a bit different - there SO_KEEPALIVE helps because you have firewall dropping packages.

nicsam1997 · Answer 13 · Wed Jan 17 2024 17:08:11 GMT+0800 (China Standard Time)

Okey I understand! The firewall part was simply a way of me making the connection go bad, I suppose there could be other ways. In general I need to figure out that given a connection that is bad, how can I remove it. For example if I had a sticky load balancer and one slow server on the other side

nicsam1997 · Answer 14 · Wed Jan 17 2024 17:09:08 GMT+0800 (China Standard Time)

Ideally I would want to mark the connection as “not reusable” if it observes a timeout

Violeta Georgieva · Answer 15 · Thu Jan 18 2024 18:14:23 GMT+0800 (China Standard Time)

@nicsam1997 Based on the provided logs and a code review, I found one issue (fixed with #3031). Will you be able to test the current snapshots? (1.0.42-SNAPSHOT/1.1.16-SNAPSHOT)

nicsam1997 · Answer 16 · Thu Jan 18 2024 19:15:16 GMT+0800 (China Standard Time)

Possibly, but what does this bugfix fix? I have not seen the max active streams exception for a long time now so that will be hard to test

Violeta Georgieva · Answer 17 · Thu Jan 18 2024 19:20:16 GMT+0800 (China Standard Time)

It tries to fix the reported problem?

Immediately aborted pooled channel, max active streams is reached, re-acquiring a new channel
Error while acquiring from reactor.netty.http.client.Http2Pool@41130dcb. Max active streams is reached.

nicsam1997 · Answer 18 · Thu Jan 18 2024 19:32:16 GMT+0800 (China Standard Time)

Since I don’t have any other way to reproduce but by running in production and wait for a long time I think testing the snapshot unfortunately will be a bit hard

Violeta Georgieva · Answer 19 · Thu Jan 18 2024 19:40:49 GMT+0800 (China Standard Time)

@nicsam1997 I'm going to close this issue as I think that #3031 should fix it. The fix will be available in 1.0.42/1.1.16 (the planned date for these releases is 13.02). When the release is available and you are able to upgrade, please test it. If the problem is still there we can reopen this issue.