Race condition in HTTP Client
yurybubnov opened this issue · comments
Under load, HTTP client reuses closed connections from the pool.
I was researching reactor.netty.channel.AbortedException: Connection has been closed BEFORE send operation
errors for a while and ended up with tcpdump and logs for the same connection.
The connection was successfully opened, used and successfully closed by server and acknowledged by client.
71896 2023-09-20 18:39:11.072236 10.15.12.148 10.15.31.167 TCP 74 54978 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM TSval=4020257053 TSecr=0 WS=128
71917 2023-09-20 18:39:11.073177 10.15.31.167 10.15.12.148 TCP 74 443 → 54978 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=8961 SACK_PERM TSval=3652594669 TSecr=4020257053 WS=256
71918 2023-09-20 18:39:11.073187 10.15.12.148 10.15.31.167 TCP 66 54978 → 443 [ACK] Seq=1 Ack=1 Win=27008 Len=0 TSval=4020257054 TSecr=3652594669
71933 2023-09-20 18:39:11.073747 10.15.12.148 10.15.31.167 TLSv1.2 583 Client Hello
71957 2023-09-20 18:39:11.074671 10.15.31.167 10.15.12.148 TCP 66 443 → 54978 [ACK] Seq=1 Ack=518 Win=28160 Len=0 TSval=3652594671 TSecr=4020257054
71984 2023-09-20 18:39:11.075574 10.15.31.167 10.15.12.148 TLSv1.2 5509 Server Hello, Certificate, Server Key Exchange, Server Hello Done
71986 2023-09-20 18:39:11.075581 10.15.12.148 10.15.31.167 TCP 66 54978 → 443 [ACK] Seq=518 Ack=5444 Win=44800 Len=0 TSval=4020257056 TSecr=3652594672
72041 2023-09-20 18:39:11.079305 10.15.12.148 10.15.31.167 TLSv1.2 192 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
72059 2023-09-20 18:39:11.080312 10.15.31.167 10.15.12.148 TCP 66 443 → 54978 [ACK] Seq=5444 Ack=644 Win=28160 Len=0 TSval=3652594677 TSecr=4020257060
72060 2023-09-20 18:39:11.080361 10.15.31.167 10.15.12.148 TLSv1.2 237 New Session Ticket, Change Cipher Spec, Encrypted Handshake Message
72061 2023-09-20 18:39:11.080438 10.15.12.148 10.15.31.167 TLSv1.2 727 Application Data
72100 2023-09-20 18:39:11.083014 10.15.31.167 10.15.12.148 TLSv1.2 257 Application Data
72102 2023-09-20 18:39:11.083212 10.15.31.167 10.15.12.148 TLSv1.2 100 Application Data
72113 2023-09-20 18:39:11.083969 10.15.12.148 10.15.31.167 TCP 66 54978 → 443 [ACK] Seq=1305 Ack=5840 Win=66560 Len=0 TSval=4020257065 TSecr=3652594679
........
bunch of data flowing here
.........
513450 2023-09-20 18:40:11.687199 10.15.12.148 10.15.31.167 TCP 66 54978 → 443 [ACK] Seq=2306299 Ack=854828 Win=108544 Len=0 TSval=4020317669 TSecr=3652655271
513551 2023-09-20 18:40:11.701571 10.15.12.148 10.15.31.167 TLSv1.2 727 Application Data
513573 2023-09-20 18:40:11.703679 10.15.31.167 10.15.12.148 TLSv1.2 257 Application Data
513576 2023-09-20 18:40:11.703710 10.15.31.167 10.15.12.148 TLSv1.2 100 Application Data
513706 2023-09-20 18:40:11.712073 10.15.12.148 10.15.31.167 TCP 66 54978 → 443 [ACK] Seq=2306960 Ack=855053 Win=108544 Len=0 TSval=4020317693 TSecr=3652655300
513832 2023-09-20 18:40:11.720647 10.15.12.148 10.15.31.167 TLSv1.2 97 Encrypted Alert
513833 2023-09-20 18:40:11.720670 10.15.12.148 10.15.31.167 TCP 66 54978 → 443 [FIN, ACK] Seq=2306991 Ack=855053 Win=108544 Len=0 TSval=4020317702 TSecr=3652655300
513842 2023-09-20 18:40:11.721486 10.15.31.167 10.15.12.148 TCP 66 443 → 54978 [FIN, ACK] Seq=855053 Ack=2306992 Win=108800 Len=0 TSval=3652655317 TSecr=4020317702
513843 2023-09-20 18:40:11.721496 10.15.12.148 10.15.31.167 TCP 66 54978 → 443 [ACK] Seq=2306992 Ack=855054 Win=108544 Len=0 TSval=4020317703 TSecr=3652655317
Then, Netty tried to use the connection and, obviously, failed.
timestamp: "09/20/2023 6:40:11.722 PM -0700",
logger: "reactor.netty.http.client.HttpClientConnect",
message: "[48ed1f67-3507, L:/10.15.12.148:54978 ! R:access-manager.private.site.com/10.15.31.167:443] The connection observed an error, the request cannot be retried as the headers/body were sent",
context: "default",
exception: "reactor.netty.channel.AbortedException: Connection has been closed BEFORE send operation
at reactor.netty.channel.AbortedException.beforeSend(AbortedException.java:59)
at reactor.netty.http.client.HttpClientOperations.onInboundClose(HttpClientOperations.java:295)
at reactor.netty.channel.ChannelOperationsHandler.channelInactive(ChannelOperationsHandler.java:73)
Expected Behavior
Closed connections should not be allocated for requests.
Actual Behavior
The client is using a closed connection
Steps to Reproduce
We experience this under load and only for HTTPS (not HTTP) connections, for some reason.
Your Environment
Java 17
Spring Cloud Gateway 4.0.7
Reactor Netty 1.1.10
Ubuntu container on AWS EC2 host
Another example.
Here is full TCP Dump
513600 2023-09-20 18:40:11.704839 10.15.12.148 10.15.40.134 TCP 74 33460 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM TSval=233519926 TSecr=0 WS=128
513609 2023-09-20 18:40:11.705428 10.15.40.134 10.15.12.148 TCP 74 443 → 33460 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=8961 SACK_PERM TSval=2639073627 TSecr=233519926 WS=256
513610 2023-09-20 18:40:11.705441 10.15.12.148 10.15.40.134 TCP 66 33460 → 443 [ACK] Seq=1 Ack=1 Win=27008 Len=0 TSval=233519927 TSecr=2639073627
513620 2023-09-20 18:40:11.706252 10.15.12.148 10.15.40.134 TLSv1.2 583 Client Hello
513632 2023-09-20 18:40:11.706824 10.15.40.134 10.15.12.148 TCP 66 443 → 33460 [ACK] Seq=1 Ack=518 Win=28160 Len=0 TSval=2639073629 TSecr=233519928
513651 2023-09-20 18:40:11.708261 10.15.40.134 10.15.12.148 TLSv1.2 5509 Server Hello, Certificate, Server Key Exchange, Server Hello Done
513652 2023-09-20 18:40:11.708274 10.15.12.148 10.15.40.134 TCP 66 33460 → 443 [ACK] Seq=518 Ack=5444 Win=44800 Len=0 TSval=233519930 TSecr=2639073630
513773 2023-09-20 18:40:11.716205 10.15.12.148 10.15.40.134 TLSv1.2 192 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
513785 2023-09-20 18:40:11.716866 10.15.40.134 10.15.12.148 TCP 66 443 → 33460 [ACK] Seq=5444 Ack=644 Win=28160 Len=0 TSval=2639073639 TSecr=233519938
513786 2023-09-20 18:40:11.716893 10.15.40.134 10.15.12.148 TLSv1.2 237 New Session Ticket, Change Cipher Spec, Encrypted Handshake Message
513845 2023-09-20 18:40:11.722007 10.15.12.148 10.15.40.134 TLSv1.2 729 Application Data
513872 2023-09-20 18:40:11.723679 10.15.40.134 10.15.12.148 TLSv1.2 257 Application Data
513875 2023-09-20 18:40:11.723736 10.15.40.134 10.15.12.148 TLSv1.2 100 Application Data
513940 2023-09-20 18:40:11.731760 10.15.12.148 10.15.40.134 TCP 66 33460 → 443 [ACK] Seq=1307 Ack=5840 Win=66560 Len=0 TSval=233519953 TSecr=2639073646
514028 2023-09-20 18:40:11.739822 10.15.12.148 10.15.40.134 TLSv1.2 97 Encrypted Alert
514029 2023-09-20 18:40:11.739838 10.15.12.148 10.15.40.134 TCP 66 33460 → 443 [FIN, ACK] Seq=1338 Ack=5840 Win=66560 Len=0 TSval=233519961 TSecr=2639073646
514032 2023-09-20 18:40:11.740432 10.15.40.134 10.15.12.148 TCP 66 443 → 33460 [FIN, ACK] Seq=5840 Ack=1339 Win=29440 Len=0 TSval=2639073662 TSecr=233519961
514033 2023-09-20 18:40:11.740448 10.15.12.148 10.15.40.134 TCP 66 33460 → 443 [ACK] Seq=1339 Ack=5841 Win=66560 Len=0 TSval=233519962 TSecr=2639073662
Corresponding error:
timestamp: 09/20/2023 6:40:11.739 PM -0700
message: "[84676159-2, L:/10.15.12.148:33460 ! R:access-manager.private.site.com/10.15.40.134:443] The connection observed an error, the request cannot be retried as the headers/body were sent",
context: "default",
exception :"reactor.netty.channel.AbortedException: Connection has been closed BEFORE send operation
at reactor.netty.channel.AbortedException.beforeSend(AbortedException.java:59)
For some reason, Netty client sends Encrypted Alert
Transport Layer Security
TLSv1.2 Record Layer: Encrypted Alert
Content Type: Alert (21)
Version: TLS 1.2 (0x0303)
Length: 26
Alert Message: Encrypted Alert
Hi @yurybubnov,
Connection has been closed BEFORE send operation
means that the connection could be obtain from the pool and the connection was alive; but before sending the request, the remote peer closed the connection (or the connection is now closed for another reason).
now, from the tcpdump, can you confirm that the gateway is running on 10.15.12.148, and the destination server is 10.15.40.134 ?
If so, the Encrypted Alert
and the first FIN seem to be sent by the gateway, not by the server, or am I missing something ?
513620 2023-09-20 18:40:11.706252 10.15.12.148 10.15.40.134 TLSv1.2 583 Client Hello
...
513651 2023-09-20 18:40:11.708261 10.15.40.134 10.15.12.148 TLSv1.2 5509 Server Hello, Certificate, Server Key Exchange, Server Hello Done
...
514028 2023-09-20 18:40:11.739822 10.15.12.148 10.15.40.134 TLSv1.2 97 Encrypted Alert
514029 2023-09-20 18:40:11.739838 10.15.12.148 10.15.40.134 TCP 66 33460 → 443 [FIN, ACK] Seq=1338 Ack=5840 Win=66560 Len=0 TSval=233519961 TSecr=2639073646
514032 2023-09-20 18:40:11.740432 10.15.40.134 10.15.12.148 TCP 66 443 → 33460 [FIN, ACK] Seq=5840 Ack=1339 Win=29440 Len=0 TSval=2639073662 TSecr=233519961
then can you check if you see some exceptions in the gateway just before the encrypted alert is sent ?
let me know ?
thanks
Hello @pderop
you're right, the gateway runs on 10.15.12.148
There are no errors and warnings other than Connection has been closed BEFORE send operation
I was able to reproduce it in non-SSL connections. In this example gateway runs on 10.15.1.82
. As you can see, it opens the connection, sends three requests, and closes it.
Here is example tcpdump
377082 2023-09-21 14:21:19.293293 10.15.1.82 10.15.11.231 TCP 74 47598 → 80 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM TSval=2656697011 TSecr=0 WS=128
377111 2023-09-21 14:21:19.293638 10.15.11.231 10.15.1.82 TCP 74 80 → 47598 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=8961 SACK_PERM TSval=2888153627 TSecr=2656697011 WS=256
377113 2023-09-21 14:21:19.293647 10.15.1.82 10.15.11.231 TCP 66 47598 → 80 [ACK] Seq=1 Ack=1 Win=27008 Len=0 TSval=2656697011 TSecr=2888153627
377294 2023-09-21 14:21:19.299810 10.15.1.82 10.15.11.231 HTTP 700 GET /get/1959559705 HTTP/1.1
377303 2023-09-21 14:21:19.299996 10.15.11.231 10.15.1.82 TCP 66 80 → 47598 [ACK] Seq=1 Ack=635 Win=28160 Len=0 TSval=2888153633 TSecr=2656697017
377349 2023-09-21 14:21:19.301385 10.15.11.231 10.15.1.82 TCP 228 80 → 47598 [PSH, ACK] Seq=1 Ack=635 Win=28160 Len=162 TSval=2888153635 TSecr=2656697017 [TCP segment of a reassembled PDU]
377350 2023-09-21 14:21:19.301395 10.15.1.82 10.15.11.231 TCP 66 47598 → 80 [ACK] Seq=635 Ack=163 Win=28032 Len=0 TSval=2656697019 TSecr=2888153635
377354 2023-09-21 14:21:19.301447 10.15.11.231 10.15.1.82 HTTP/JSON 71 HTTP/1.1 200 OK , JavaScript Object Notation (application/json)
377355 2023-09-21 14:21:19.301453 10.15.1.82 10.15.11.231 TCP 66 47598 → 80 [ACK] Seq=635 Ack=168 Win=28032 Len=0 TSval=2656697019 TSecr=2888153635
377648 2023-09-21 14:21:19.321272 10.15.1.82 10.15.11.231 HTTP 699 GET /get/1868857449 HTTP/1.1
377688 2023-09-21 14:21:19.323131 10.15.11.231 10.15.1.82 TCP 228 80 → 47598 [PSH, ACK] Seq=168 Ack=1268 Win=29440 Len=162 TSval=2888153657 TSecr=2656697039 [TCP segment of a reassembled PDU]
377691 2023-09-21 14:21:19.323164 10.15.11.231 10.15.1.82 HTTP/JSON 71 HTTP/1.1 200 OK , JavaScript Object Notation (application/json)
377729 2023-09-21 14:21:19.325043 10.15.1.82 10.15.11.231 TCP 66 47598 → 80 [ACK] Seq=1268 Ack=335 Win=29056 Len=0 TSval=2656697042 TSecr=2888153657
377834 2023-09-21 14:21:19.328871 10.15.1.82 10.15.11.231 HTTP 700 GET /get/1719730930 HTTP/1.1
377866 2023-09-21 14:21:19.330341 10.15.11.231 10.15.1.82 TCP 228 80 → 47598 [PSH, ACK] Seq=335 Ack=1902 Win=30720 Len=162 TSval=2888153664 TSecr=2656697046 [TCP segment of a reassembled PDU]
377867 2023-09-21 14:21:19.330410 10.15.11.231 10.15.1.82 HTTP/JSON 71 HTTP/1.1 200 OK , JavaScript Object Notation (application/json)
377974 2023-09-21 14:21:19.340463 10.15.1.82 10.15.11.231 TCP 66 47598 → 80 [ACK] Seq=1902 Ack=502 Win=30208 Len=0 TSval=2656697058 TSecr=2888153664
378084 2023-09-21 14:21:19.348936 10.15.1.82 10.15.11.231 TCP 66 47598 → 80 [FIN, ACK] Seq=1902 Ack=502 Win=30208 Len=0 TSval=2656697066 TSecr=2888153664
378090 2023-09-21 14:21:19.349165 10.15.11.231 10.15.1.82 TCP 66 80 → 47598 [FIN, ACK] Seq=502 Ack=1903 Win=30720 Len=0 TSval=2888153683 TSecr=2656697066
378091 2023-09-21 14:21:19.349170 10.15.1.82 10.15.11.231 TCP 66 47598 → 80 [ACK] Seq=1903 Ack=503 Win=30208 Len=0 TSval=2656697066 TSecr=2888153683
it will be easier to investigate this issue if you can reproduce it with plain HTTP. I assume that in your last post, you are observing also some Connection has been closed BEFORE send operation
logs just before the gateway is closing the connection ?
So, yesterday, I could not reproduce the problem. Are you able to reproduce it on localhost ? if so, can you provide a minimal reproducer project ? just a minimal spring cloud gateway, with your yml configuration, and if possible with the java routes (if you are using ones), and indicate which http requests can be sent to the GW ? (is it a POST ? it is using transfer-encoding: chunked, etc ...)
else, if you can't provide a reproducer project:
- can you reproduce the problem with DEBUG logs and provide them ?
- can you provide your application.yml file (in order to check the httpclient configurations like max-idle-time, max-life-time, the configured routes, etc ...)
- can you also tell if you are using some custom java based routes ? and can you tell if disabling these routes are resolving the problem (by just keeping the simplest configured routes from the yml file) ?
- if you are using httpclient max-idle-time or max-life-time, can you tell if disabling them is fixing the problem ?
thanks.
Hi @yurybubnov ,
I'm closing this one for the moment, but of course this issue can be reopened if you have time to check the previous questions (I tried to reproduce the issue two weeks ago, but I could not).
thank you.