Azure / azure-signalr

Azure SignalR Service SDK for .NET

Home Page:https://aka.ms/signalr-service

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ack-able message Microsoft.Azure.SignalR.Protocol.JoinGroupWithAckMessage timed out

AhsanKhan-Marn opened this issue · comments

I faced connection drops for the SignalR client and I received these errors:

"Ack-able message Microsoft.Azure.SignalR.Protocol.JoinGroupWithAckMessage(ackId: 196215) timed out."
"Unable to write message to endpoint: {{URL of Service}}"

It could happen when the connection drops. And connection drop could happen when: service restarts, service maintenance, or network intermittent issues. Here contains more details of the explanation https://learn.microsoft.com/en-us/azure/azure-signalr/signalr-howto-troubleshoot-guide#server_connection_drop.

In general, we recommend adding retry logic with TimeoutException when adding a connection to a group.

Hello @vicancy
We already have retry logic. It was not happening for a long time but now we again received these errors.

Today Also it's happening again and again as showing in Picture below

image

Could you email me lianwei(at)microsoft.com your resource id and the timestamps when such issues happen?

Sent the email with required details

@vicancy please see the email from ahsan@marn.com

commented

We see this issue as soon as we upgrade from version 1.19.2 upwards.
Going back to that version magically "resolves" this issue.
Similar issue was reported in the past too #745

Hello @vicancy We already have retry logic. It was not happening for a long time but now we again received these errors.

Today Also it's happening again and again as showing in Picture below

image

This issue is caused by app server thread starving with app server thread count peaks during those periods and then server connections were closed by "PingTimeout"

We see this issue as soon as we upgrade from version 1.19.2 upwards. Going back to that version magically "resolves" this issue. Similar issue was reported in the past too #745

This one should be related to this fix #1779 @KKhurin All the group messages are changed to be sticky to one server connection.

Hi @SplitThePotCyrus could you email me lianwei(at)microsoft.com your resource name and the time when you upgraded and see the timeout errors to allow me do a further check in the service side?

Hello @vicancy We already have retry logic. It was not happening for a long time but now we again received these errors.
Today Also it's happening again and again as showing in Picture below
image

This issue is caused by app server thread starving with app server thread count peaks during those periods and then server connections were closed by "PingTimeout"

Yes, thanks for your support. I am following the suggestions you shared in your email. Also, I have removed the logic from OnDisconnectedAsync method which may cause thread starving

Also getting this issue, and wondered if there's any guidance that could be shared from the suggestions in the emails conversations you've had?

Many different scenarios can lead to this exception:

  1. Server connection dropped around that period
  2. A peak on such join/leave group requests
  3. Heavy traffic that the server connection is unable to serve
  4. Heavy traffic that the service is unable to serve
  5. App server side high CPU or thread starvation https://learn.microsoft.com/en-us/azure/azure-signalr/signalr-howto-troubleshoot-guide#thread-pool-starvation

The metrics and logs might give hints on the root cause.