reactor / reactor-netty

TCP/HTTP/UDP/QUIC client/server with Reactor over Netty

Home Page:https://projectreactor.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Race condition in HTTP Client

yurybubnov opened this issue · comments

Under load, HTTP client reuses closed connections from the pool.

I was researching reactor.netty.channel.AbortedException: Connection has been closed BEFORE send operation errors for a while and ended up with tcpdump and logs for the same connection.
The connection was successfully opened, used and successfully closed by server and acknowledged by client.

71896	2023-09-20 18:39:11.072236	10.15.12.148	10.15.31.167	TCP	74	54978 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM TSval=4020257053 TSecr=0 WS=128
71917	2023-09-20 18:39:11.073177	10.15.31.167	10.15.12.148	TCP	74	443 → 54978 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=8961 SACK_PERM TSval=3652594669 TSecr=4020257053 WS=256
71918	2023-09-20 18:39:11.073187	10.15.12.148	10.15.31.167	TCP	66	54978 → 443 [ACK] Seq=1 Ack=1 Win=27008 Len=0 TSval=4020257054 TSecr=3652594669
71933	2023-09-20 18:39:11.073747	10.15.12.148	10.15.31.167	TLSv1.2	583	Client Hello
71957	2023-09-20 18:39:11.074671	10.15.31.167	10.15.12.148	TCP	66	443 → 54978 [ACK] Seq=1 Ack=518 Win=28160 Len=0 TSval=3652594671 TSecr=4020257054
71984	2023-09-20 18:39:11.075574	10.15.31.167	10.15.12.148	TLSv1.2	5509	Server Hello, Certificate, Server Key Exchange, Server Hello Done
71986	2023-09-20 18:39:11.075581	10.15.12.148	10.15.31.167	TCP	66	54978 → 443 [ACK] Seq=518 Ack=5444 Win=44800 Len=0 TSval=4020257056 TSecr=3652594672
72041	2023-09-20 18:39:11.079305	10.15.12.148	10.15.31.167	TLSv1.2	192	Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
72059	2023-09-20 18:39:11.080312	10.15.31.167	10.15.12.148	TCP	66	443 → 54978 [ACK] Seq=5444 Ack=644 Win=28160 Len=0 TSval=3652594677 TSecr=4020257060
72060	2023-09-20 18:39:11.080361	10.15.31.167	10.15.12.148	TLSv1.2	237	New Session Ticket, Change Cipher Spec, Encrypted Handshake Message
72061	2023-09-20 18:39:11.080438	10.15.12.148	10.15.31.167	TLSv1.2	727	Application Data
72100	2023-09-20 18:39:11.083014	10.15.31.167	10.15.12.148	TLSv1.2	257	Application Data
72102	2023-09-20 18:39:11.083212	10.15.31.167	10.15.12.148	TLSv1.2	100	Application Data
72113	2023-09-20 18:39:11.083969	10.15.12.148	10.15.31.167	TCP	66	54978 → 443 [ACK] Seq=1305 Ack=5840 Win=66560 Len=0 TSval=4020257065 TSecr=3652594679
........
bunch of data flowing here
.........
513450	2023-09-20 18:40:11.687199	10.15.12.148	10.15.31.167	TCP	66	54978 → 443 [ACK] Seq=2306299 Ack=854828 Win=108544 Len=0 TSval=4020317669 TSecr=3652655271
513551	2023-09-20 18:40:11.701571	10.15.12.148	10.15.31.167	TLSv1.2	727	Application Data
513573	2023-09-20 18:40:11.703679	10.15.31.167	10.15.12.148	TLSv1.2	257	Application Data
513576	2023-09-20 18:40:11.703710	10.15.31.167	10.15.12.148	TLSv1.2	100	Application Data
513706	2023-09-20 18:40:11.712073	10.15.12.148	10.15.31.167	TCP	66	54978 → 443 [ACK] Seq=2306960 Ack=855053 Win=108544 Len=0 TSval=4020317693 TSecr=3652655300
513832	2023-09-20 18:40:11.720647	10.15.12.148	10.15.31.167	TLSv1.2	97	Encrypted Alert
513833	2023-09-20 18:40:11.720670	10.15.12.148	10.15.31.167	TCP	66	54978 → 443 [FIN, ACK] Seq=2306991 Ack=855053 Win=108544 Len=0 TSval=4020317702 TSecr=3652655300
513842	2023-09-20 18:40:11.721486	10.15.31.167	10.15.12.148	TCP	66	443 → 54978 [FIN, ACK] Seq=855053 Ack=2306992 Win=108800 Len=0 TSval=3652655317 TSecr=4020317702
513843	2023-09-20 18:40:11.721496	10.15.12.148	10.15.31.167	TCP	66	54978 → 443 [ACK] Seq=2306992 Ack=855054 Win=108544 Len=0 TSval=4020317703 TSecr=3652655317

Then, Netty tried to use the connection and, obviously, failed.

timestamp: "09/20/2023 6:40:11.722 PM -0700",
logger: "reactor.netty.http.client.HttpClientConnect",
message: "[48ed1f67-3507, L:/10.15.12.148:54978 ! R:access-manager.private.site.com/10.15.31.167:443] The connection observed an error, the request cannot be retried as the headers/body were sent",
context: "default",
exception: "reactor.netty.channel.AbortedException: Connection has been closed BEFORE send operation
	at reactor.netty.channel.AbortedException.beforeSend(AbortedException.java:59)
	at reactor.netty.http.client.HttpClientOperations.onInboundClose(HttpClientOperations.java:295)
	at reactor.netty.channel.ChannelOperationsHandler.channelInactive(ChannelOperationsHandler.java:73)

Expected Behavior

Closed connections should not be allocated for requests.

Actual Behavior

The client is using a closed connection

Steps to Reproduce

We experience this under load and only for HTTPS (not HTTP) connections, for some reason.

Your Environment

Java 17
Spring Cloud Gateway 4.0.7
Reactor Netty 1.1.10
Ubuntu container on AWS EC2 host

Another example.
Here is full TCP Dump

513600	2023-09-20 18:40:11.704839	10.15.12.148	10.15.40.134	TCP	74	33460 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM TSval=233519926 TSecr=0 WS=128
513609	2023-09-20 18:40:11.705428	10.15.40.134	10.15.12.148	TCP	74	443 → 33460 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=8961 SACK_PERM TSval=2639073627 TSecr=233519926 WS=256
513610	2023-09-20 18:40:11.705441	10.15.12.148	10.15.40.134	TCP	66	33460 → 443 [ACK] Seq=1 Ack=1 Win=27008 Len=0 TSval=233519927 TSecr=2639073627
513620	2023-09-20 18:40:11.706252	10.15.12.148	10.15.40.134	TLSv1.2	583	Client Hello
513632	2023-09-20 18:40:11.706824	10.15.40.134	10.15.12.148	TCP	66	443 → 33460 [ACK] Seq=1 Ack=518 Win=28160 Len=0 TSval=2639073629 TSecr=233519928
513651	2023-09-20 18:40:11.708261	10.15.40.134	10.15.12.148	TLSv1.2	5509	Server Hello, Certificate, Server Key Exchange, Server Hello Done
513652	2023-09-20 18:40:11.708274	10.15.12.148	10.15.40.134	TCP	66	33460 → 443 [ACK] Seq=518 Ack=5444 Win=44800 Len=0 TSval=233519930 TSecr=2639073630
513773	2023-09-20 18:40:11.716205	10.15.12.148	10.15.40.134	TLSv1.2	192	Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
513785	2023-09-20 18:40:11.716866	10.15.40.134	10.15.12.148	TCP	66	443 → 33460 [ACK] Seq=5444 Ack=644 Win=28160 Len=0 TSval=2639073639 TSecr=233519938
513786	2023-09-20 18:40:11.716893	10.15.40.134	10.15.12.148	TLSv1.2	237	New Session Ticket, Change Cipher Spec, Encrypted Handshake Message
513845	2023-09-20 18:40:11.722007	10.15.12.148	10.15.40.134	TLSv1.2	729	Application Data
513872	2023-09-20 18:40:11.723679	10.15.40.134	10.15.12.148	TLSv1.2	257	Application Data
513875	2023-09-20 18:40:11.723736	10.15.40.134	10.15.12.148	TLSv1.2	100	Application Data
513940	2023-09-20 18:40:11.731760	10.15.12.148	10.15.40.134	TCP	66	33460 → 443 [ACK] Seq=1307 Ack=5840 Win=66560 Len=0 TSval=233519953 TSecr=2639073646
514028	2023-09-20 18:40:11.739822	10.15.12.148	10.15.40.134	TLSv1.2	97	Encrypted Alert
514029	2023-09-20 18:40:11.739838	10.15.12.148	10.15.40.134	TCP	66	33460 → 443 [FIN, ACK] Seq=1338 Ack=5840 Win=66560 Len=0 TSval=233519961 TSecr=2639073646
514032	2023-09-20 18:40:11.740432	10.15.40.134	10.15.12.148	TCP	66	443 → 33460 [FIN, ACK] Seq=5840 Ack=1339 Win=29440 Len=0 TSval=2639073662 TSecr=233519961
514033	2023-09-20 18:40:11.740448	10.15.12.148	10.15.40.134	TCP	66	33460 → 443 [ACK] Seq=1339 Ack=5841 Win=66560 Len=0 TSval=233519962 TSecr=2639073662

Corresponding error:

timestamp: 09/20/2023 6:40:11.739 PM -0700
message: "[84676159-2, L:/10.15.12.148:33460 ! R:access-manager.private.site.com/10.15.40.134:443] The connection observed an error, the request cannot be retried as the headers/body were sent",
context: "default",
exception :"reactor.netty.channel.AbortedException: Connection has been closed BEFORE send operation
	at reactor.netty.channel.AbortedException.beforeSend(AbortedException.java:59)

For some reason, Netty client sends Encrypted Alert

Transport Layer Security
    TLSv1.2 Record Layer: Encrypted Alert
        Content Type: Alert (21)
        Version: TLS 1.2 (0x0303)
        Length: 26
        Alert Message: Encrypted Alert

Hi @yurybubnov,

Connection has been closed BEFORE send operation means that the connection could be obtain from the pool and the connection was alive; but before sending the request, the remote peer closed the connection (or the connection is now closed for another reason).

now, from the tcpdump, can you confirm that the gateway is running on 10.15.12.148, and the destination server is 10.15.40.134 ?

If so, the Encrypted Alert and the first FIN seem to be sent by the gateway, not by the server, or am I missing something ?

513620	2023-09-20 18:40:11.706252	10.15.12.148	10.15.40.134	TLSv1.2	583	Client Hello
...
513651	2023-09-20 18:40:11.708261	10.15.40.134	10.15.12.148	TLSv1.2	5509	Server Hello, Certificate, Server Key Exchange, Server Hello Done
...
514028	2023-09-20 18:40:11.739822	10.15.12.148	10.15.40.134	TLSv1.2	97	Encrypted Alert
514029	2023-09-20 18:40:11.739838	10.15.12.148	10.15.40.134	TCP	66	33460 → 443 [FIN, ACK] Seq=1338 Ack=5840 Win=66560 Len=0 TSval=233519961 TSecr=2639073646
514032	2023-09-20 18:40:11.740432	10.15.40.134	10.15.12.148	TCP	66	443 → 33460 [FIN, ACK] Seq=5840 Ack=1339 Win=29440 Len=0 TSval=2639073662 TSecr=233519961

then can you check if you see some exceptions in the gateway just before the encrypted alert is sent ?

let me know ?
thanks

Hello @pderop
you're right, the gateway runs on 10.15.12.148
There are no errors and warnings other than Connection has been closed BEFORE send operation
I was able to reproduce it in non-SSL connections. In this example gateway runs on 10.15.1.82. As you can see, it opens the connection, sends three requests, and closes it.
Here is example tcpdump

377082	2023-09-21 14:21:19.293293	10.15.1.82	10.15.11.231	TCP	74	47598 → 80 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM TSval=2656697011 TSecr=0 WS=128
377111	2023-09-21 14:21:19.293638	10.15.11.231	10.15.1.82	TCP	74	80 → 47598 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=8961 SACK_PERM TSval=2888153627 TSecr=2656697011 WS=256
377113	2023-09-21 14:21:19.293647	10.15.1.82	10.15.11.231	TCP	66	47598 → 80 [ACK] Seq=1 Ack=1 Win=27008 Len=0 TSval=2656697011 TSecr=2888153627
377294	2023-09-21 14:21:19.299810	10.15.1.82	10.15.11.231	HTTP	700	GET /get/1959559705 HTTP/1.1 
377303	2023-09-21 14:21:19.299996	10.15.11.231	10.15.1.82	TCP	66	80 → 47598 [ACK] Seq=1 Ack=635 Win=28160 Len=0 TSval=2888153633 TSecr=2656697017
377349	2023-09-21 14:21:19.301385	10.15.11.231	10.15.1.82	TCP	228	80 → 47598 [PSH, ACK] Seq=1 Ack=635 Win=28160 Len=162 TSval=2888153635 TSecr=2656697017 [TCP segment of a reassembled PDU]
377350	2023-09-21 14:21:19.301395	10.15.1.82	10.15.11.231	TCP	66	47598 → 80 [ACK] Seq=635 Ack=163 Win=28032 Len=0 TSval=2656697019 TSecr=2888153635
377354	2023-09-21 14:21:19.301447	10.15.11.231	10.15.1.82	HTTP/JSON	71	HTTP/1.1 200 OK , JavaScript Object Notation (application/json)
377355	2023-09-21 14:21:19.301453	10.15.1.82	10.15.11.231	TCP	66	47598 → 80 [ACK] Seq=635 Ack=168 Win=28032 Len=0 TSval=2656697019 TSecr=2888153635
377648	2023-09-21 14:21:19.321272	10.15.1.82	10.15.11.231	HTTP	699	GET /get/1868857449 HTTP/1.1 
377688	2023-09-21 14:21:19.323131	10.15.11.231	10.15.1.82	TCP	228	80 → 47598 [PSH, ACK] Seq=168 Ack=1268 Win=29440 Len=162 TSval=2888153657 TSecr=2656697039 [TCP segment of a reassembled PDU]
377691	2023-09-21 14:21:19.323164	10.15.11.231	10.15.1.82	HTTP/JSON	71	HTTP/1.1 200 OK , JavaScript Object Notation (application/json)
377729	2023-09-21 14:21:19.325043	10.15.1.82	10.15.11.231	TCP	66	47598 → 80 [ACK] Seq=1268 Ack=335 Win=29056 Len=0 TSval=2656697042 TSecr=2888153657
377834	2023-09-21 14:21:19.328871	10.15.1.82	10.15.11.231	HTTP	700	GET /get/1719730930 HTTP/1.1 
377866	2023-09-21 14:21:19.330341	10.15.11.231	10.15.1.82	TCP	228	80 → 47598 [PSH, ACK] Seq=335 Ack=1902 Win=30720 Len=162 TSval=2888153664 TSecr=2656697046 [TCP segment of a reassembled PDU]
377867	2023-09-21 14:21:19.330410	10.15.11.231	10.15.1.82	HTTP/JSON	71	HTTP/1.1 200 OK , JavaScript Object Notation (application/json)
377974	2023-09-21 14:21:19.340463	10.15.1.82	10.15.11.231	TCP	66	47598 → 80 [ACK] Seq=1902 Ack=502 Win=30208 Len=0 TSval=2656697058 TSecr=2888153664
378084	2023-09-21 14:21:19.348936	10.15.1.82	10.15.11.231	TCP	66	47598 → 80 [FIN, ACK] Seq=1902 Ack=502 Win=30208 Len=0 TSval=2656697066 TSecr=2888153664
378090	2023-09-21 14:21:19.349165	10.15.11.231	10.15.1.82	TCP	66	80 → 47598 [FIN, ACK] Seq=502 Ack=1903 Win=30720 Len=0 TSval=2888153683 TSecr=2656697066
378091	2023-09-21 14:21:19.349170	10.15.1.82	10.15.11.231	TCP	66	47598 → 80 [ACK] Seq=1903 Ack=503 Win=30208 Len=0 TSval=2656697066 TSecr=2888153683

it will be easier to investigate this issue if you can reproduce it with plain HTTP. I assume that in your last post, you are observing also some Connection has been closed BEFORE send operation logs just before the gateway is closing the connection ?

So, yesterday, I could not reproduce the problem. Are you able to reproduce it on localhost ? if so, can you provide a minimal reproducer project ? just a minimal spring cloud gateway, with your yml configuration, and if possible with the java routes (if you are using ones), and indicate which http requests can be sent to the GW ? (is it a POST ? it is using transfer-encoding: chunked, etc ...)

else, if you can't provide a reproducer project:

  • can you reproduce the problem with DEBUG logs and provide them ?
  • can you provide your application.yml file (in order to check the httpclient configurations like max-idle-time, max-life-time, the configured routes, etc ...)
  • can you also tell if you are using some custom java based routes ? and can you tell if disabling these routes are resolving the problem (by just keeping the simplest configured routes from the yml file) ?
  • if you are using httpclient max-idle-time or max-life-time, can you tell if disabling them is fixing the problem ?

thanks.

Hi @yurybubnov ,

I'm closing this one for the moment, but of course this issue can be reopened if you have time to check the previous questions (I tried to reproduce the issue two weeks ago, but I could not).

thank you.