reactor / reactor-netty

TCP/HTTP/UDP/QUIC client/server with Reactor over Netty

Home Page:https://projectreactor.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Timeout leaves connection in the pool in configured state, blocking a pool slot

jkippen opened this issue · comments

We have made use of Mono Timeout to catch Tcp connection timeouts (TBH, I'd much rather use the timeout property on the TcpClient, but I'm not in control of the code).

When the Timeout fires, I see in the connection pool that there is one connection in configured state. When I try to acquire another connection (I defined a pool of size 1), I hit the default 45s acquire timeout. It seems like the connection abort that should occur with the cancelled Mono doesn't remove the connection from the pool. I'm not sure if the delay() I inject between the connect() and the timeout() causes issue with the cancel and abort, but I've also noticed that firing of doOnCancel is hit and miss.

Couple notes:

  1. .doOnCancel I added recently to see if the mono was being cancelled; same result with and without
  2. The random delay I added initially to try and elicit a potential race condition (after the defer I added a number of repeats). Only 1 run was needed.
  3. The connection is disposed within flatMap, though doesn't matter since Timeout fires

Expected Behavior

If a connection mono is emitted, it is upon the developer to dispose of the connection. If a connection is not available, a slot in the connection pool is not blocked.

Actual Behavior

In a controlled scenario of pool size 1, if Timeout fires sometime after connect(), a subsequent acquire times out.

Steps to Reproduce

Here's a code sample:

/* This is the TcpServer I have running for the test client to connect to
 DisposableServer server = TcpServer
            .create()
            .host("localhost")
            .port(5555)
            .metrics(true)
            .observe(new ConnectionObserver() {
                @Override
                public void onStateChange(Connection connection, State newState) {
                    log.info("Server Connection State: " + connection.channel().toString() + " // " + newState + " // " + Thread.currentThread().getName() );                       
                }

                @Override
                public void onUncaughtException(Connection connection, Throwable error) { 
                   log.debug("*********** error" + connection);                 
                }
            })
            .handle( (i, o) -> {
                log.info("******** Server receive " + i);
                Flux<String> payload = i
                                        .receive()
                                        .retain()
                                        .asString()
                                       // .delayElements(Duration.ofMillis(Random.from(RandomGenerator.getDefault()).nextInt(1000, 4000)))
                                        .doOnNext(data -> {
                                            log.info(data);
                                        });
                return o.sendString(payload).then(t -> {log.info("******** Server send " + i); });    
            })
            .bindNow();
          server.onDispose().block();  
     */

/// Connection pool of size 1
ConnectionProvider provider = ConnectionProvider.create("test", 1, true);
        
TcpClient client = TcpClient.create(provider) 
                                        .host("localhost")
                                        .port(5555)
                                       // .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 2000)
                                        .observe( new ConnectionObserver() {
                                            @Override
                                            public void onStateChange(Connection connection, State newState) {
                                                log.info("Client Connection State: " + connection.channel().toString() + " // " + newState + " // " + Thread.currentThread().getName() );
                                                //monitor.put(connection.channel().id().toString(), newState.toString());
                                            }

                                            @Override
                                            public void onUncaughtException(Connection connection, Throwable error) {                                        
                                                log.debug("*********** error" + connection);     
                                            }
                                        })
                                    // .wiretap("Client", LogLevel.INFO)
                                    .metrics(true);

/// This connect times out, leaves a connection in configured state
Mono
        .defer(() -> client
                        .connect()
                        .doOnCancel(() -> log.debug("********* cancelling"))
                        .delayElement(Duration.ofMillis(Random.from(RandomGenerator.getDefault()).nextInt(900, 1000)))
                        .timeout(Duration.ofMillis(900))                      
                        .flatMap(connection -> sendAndReceive(connection))
        ).block();

/// This connect cannot acquire a connection
 Mono
        .defer(() -> client
                        .connect()
                        .flatMap(connection -> sendAndReceive(connection))
                        .doOnError(e -> {
                            e.printStackTrace();
                        })     
                        .onErrorComplete()
                )
                .block();

Client log
image

Server log
image

Your Environment

Here's my build.gradle:

plugins {
id 'org.springframework.boot' version '3.2.0'
id 'io.spring.dependency-management' version '1.1.4'
id 'java'
id("io.freefair.lombok") version "8.6"
}

group = 'com.example'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = '21'

repositories {
mavenCentral()
}

java {
toolchain {
languageVersion = JavaLanguageVersion.of(21)
}
}

dependencies {
implementation 'org.springframework.boot:spring-boot-starter-webflux'
implementation 'org.springframework.boot:spring-boot-starter-actuator'
implementation platform('io.projectreactor:reactor-bom:2020.0.9')
implementation 'jakarta.xml.bind:jakarta.xml.bind-api:3.0.1'
implementation 'com.sun.xml.bind:jaxb-xjc:3.0.1'
implementation 'io.micrometer:micrometer-core'
implementation platform('io.micrometer:micrometer-bom:latest.release')
implementation 'io.micrometer:micrometer-observation'
runtimeOnly 'com.sun.xml.bind:jaxb-impl:3.0.1'
runtimeOnly 'org.glassfish.jaxb:jaxb-runtime:3.0.1'
implementation 'io.projectreactor:reactor-core'
implementation 'io.projectreactor:reactor-core-micrometer'
implementation 'io.projectreactor.tools:blockhound:1.0.8.RELEASE'
//implementation 'io.micrometer:micrometer-registry-atlas:latest.release'
testImplementation('org.springframework.boot:spring-boot-starter-test')
}

  • OS and version (eg. uname -a): 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:28:58 PST 2023; root:xnu-10002.81.5~7/RELEASE_X86_64 x86_64

@jkippen With this reproducible example, I'm seeing that the connection is always emitted but your pipeline delays it intentionally. If I add log like below I can see that connection was delivered.

Mono
				.defer(() -> client
						.connect()
						.log()
						.doOnCancel(() -> log.debug("********* cancelling"))
...

As the connection is delivered, you have to guarantee that the connection is disposed regardless whether it is cancellation, error or normal flow.

The logs:

10:01:02.562 [reactor-tcp-nio-3] DEBUG r.n.r.DefaultPooledConnectionProvider - [054ae908, L:/127.0.0.1:50560 - R:localhost/127.0.0.1:5555] onStateChange(PooledConnection{channel=[id: 0x054ae908, L:/127.0.0.1:50560 - R:localhost/127.0.0.1:5555]}, [connected])
10:01:02.562 [reactor-tcp-nio-3] INFO  reactor.netty.tcp.TcpServerTests - Client Connection State: [id: 0x054ae908, L:/127.0.0.1:50560 - R:localhost/127.0.0.1:5555] // [connected] // reactor-tcp-nio-3
10:01:02.569 [reactor-tcp-nio-3] DEBUG r.n.r.DefaultPooledConnectionProvider - [054ae908, L:/127.0.0.1:50560 - R:localhost/127.0.0.1:5555] onStateChange(ChannelOperations{PooledConnection{channel=[id: 0x054ae908, L:/127.0.0.1:50560 - R:localhost/127.0.0.1:5555]}}, [configured])
10:01:02.570 [reactor-tcp-nio-4] DEBUG reactor.netty.tcp.TcpServer - [3d8e3123, L:/127.0.0.1:5555 - R:/127.0.0.1:50560] Handler is being applied: reactor.netty.tcp.TcpServerTests$$Lambda$407/1621513804@55d2037f
10:01:02.570 [reactor-tcp-nio-3] INFO  reactor.Mono.Create.2 - onNext(ChannelOperations{PooledConnection{channel=[id: 0x054ae908, L:/127.0.0.1:50560 - R:localhost/127.0.0.1:5555]}})
10:01:02.570 [reactor-tcp-nio-4] INFO  reactor.netty.tcp.TcpServerTests - ******** Server receive ChannelOperations{SimpleConnection{channel=[id: 0x3d8e3123, L:/127.0.0.1:5555 - R:/127.0.0.1:50560]}}
10:01:02.570 [reactor-tcp-nio-3] INFO  reactor.Mono.Create.2 - onComplete()
10:01:02.570 [reactor-tcp-nio-3] INFO  reactor.netty.tcp.TcpServerTests - Client Connection State: [id: 0x054ae908, L:/127.0.0.1:50560 - R:localhost/127.0.0.1:5555] // [configured] // reactor-tcp-nio-3

I may have not explained the issue correctly - I intentionally add a delay to simulate a firing of the Timeout before the connection is delivered, to try and see if there is a race condition between connect() delivering a connection, Timeout firing, and the connection not being aborted.

Even when the doOnCancel fires, I still see this connection hanging around in configured state, taking a slot in the connection pool.

The doOnCancel I just added for some observability into the abort of the connection

@jkippen With the current example I cannot reproduce what you are explaining. It seems that you can, so please provide logs from an execution where you see the behaviour that you are explaining.

I added a couple screenshots from my client and server logs, is it visible?

Here's smaller snippet of code if it helps. In the client log I attached above, the first exception is the Timeout firing, and the second is the acquire timeout from the second connect() attempt.

ConnectionProvider provider = ConnectionProvider.create("test", 1, true);

/// This connect times out, leaves a connection in configured state
Mono
        .defer(() -> client
                        .connect()
                        .delayElement(Duration.ofMillis(1000))
                        .timeout(Duration.ofMillis(900))                      
                        .flatMap(connection -> sendAndReceive(connection))
        ).block();

/// This connect cannot acquire a connection
 Mono
        .defer(() -> client
                        .connect()
                        .flatMap(connection -> sendAndReceive(connection))
                        .doOnError(e -> {
                            e.printStackTrace();
                        })     
                        .onErrorComplete()
                )
                .block();

oh I see what you're saying - I removed some stuff to share cleaned up code... will share again in a moment

I have a feeling this issue is very similar to this one from reactor-pool. Please have a look at my comment, perhaps using .doOnDiscard() you will be able to fix it.

I want to see logs from connect(), whether we delivered the connection or not. So @jkippen please add .log() after the connect()
I need to know whether the issue is a cancelation while the connection still is not delivered or after that.

Okay, here's the code for the first connect() attempt that times out (I had removed onErrorComplete - will add log and send the output shortly

  // timeout via reactive        
        Mono
        .defer(() -> client
                        .connect()
                        .doOnCancel(() -> log.debug("********* cancelling"))
                        .delayElement(Duration.ofMillis(Random.from(RandomGenerator.getDefault()).nextInt(900, 1000)))
                        .timeout(Duration.ofMillis(9))                      
                        .flatMap(connection -> sendAndReceive(connection))
                        .doOnError(e -> {
                            e.printStackTrace();
                        })
                        .onErrorComplete()
        ).block();

Here's the log with log() output

image

@chemicL to be honest, I'm assuming that using the connect timeout on the TcpClient itself would avoid the problem (though I'm not sure, and not sure how to validate that). Even so (and I don't know much about Reactor), if the responsibility of timeout(..) is to cancel the mono, shouldn't it abort the connection regardless of whether it was emitted from connect() or not?

In my example, if you change the timeout to something closer to the delay (i.e. 900ms) the doOnCancel also doesn't run. In the output of my last test, doOnCancel did run (the timeout was 9ms)

Here's an output of the log when doOnCancel doesn't run, if it's helpful

image

It looks like for 9ms you hit a race where the cancellation reaches your doOnCancel but at the same time the connection is being delivered. That's my expectation as the logs show the connection was established. The connect timeout would apply if the connection didn't succeed to be established, but it succeeds and is delivered to your delayElement() operator. That operator keeps a reference to the connection. When the delay is longer than the argument to timeout() it means that the scheduled task that holds the reference to the connection is disposed and the reference is discarded. If there is no discard hook defined on the reactive pipeline then the reference is just abandoned and the connection is not returned to the pool. That's my understanding of the situation. reactor-netty's timeout configuration have no insight into the pipeline that you attach to the results and can't control your code that refers to the connection.

Just trying to help - please check if adding the discard hook that releases the connection helps.

@jkippen One additional question - you are testing with this ancient version -> implementation platform('io.projectreactor:reactor-bom:2020.0.9') or with the latest releases?

Ah - I was using the ancient version. However, I updated my gradle version to io.projectreactor:reactor-bom:2023.0.4, cleaned my workspace, broke my workspace, fixed it, and retested. Got the same result.

@jkippen I agree with the comment made by @chemicL. You need to handle cancellation and error events. From Reactor Netty point of view it is dangerous to handle these events and just close the connection, We don't know what kind of protocol is implemented and whether this protocol needs to do some cleanup in these situations.