[QUESTION] Expected behavior of parallel Strategy for MultiSubnetFailover option

Question

[QUESTION] Expected behavior of parallel Strategy for MultiSubnetFailover option

ml-rex opened this issue a year ago · comments

Question

I have an application that connect to a SQL Server Multi-Subnet Cluster with two subnets (primary and DR subnets).
With this setup, the DR is in offline state while the primary is the active one. Also, we have setup the Availability Group Listener with a DNS record round-robin the two subnet IP addresses. The application is using TypeORM with mssql driver, which use tedious.

https://learn.microsoft.com/en-us/sql/sql-server/failover-clusters/windows/sql-server-multi-subnet-clustering-sql-server?view=sql-server-ver16

As suggested in the link, we added the multiSubnetFailover: true option to the connection config and we expect tedious will only create connection to the database nodes in the active primary subnet. However, sometimes we receive the error message: "ConnectionError: Connection lost - read ECONNRESET".

After deliberate effort of investigation, we see that a pattern that tedious was connecting to the offline DR IP when this error happens. This is out of my expectation since the offline IP is supposed to fail the pool validation check and should not be created in the connection pool.

Looking deep into the source code of tedious with debug tool, i can confirm that the ParallelConnectionStrategy was being used when the multiSubnetFailover option is provided. And apparently the TCP connection was established successfully for the offline IP but later on the connection will emit an error. I added some console log to visualize what happened:

ParallelConnectionStrategy Addresses: [{"address":"DR_IP", "family": 4}, {"address": "PRIMARY_IP", "family": 4}]
Connecting {"address": "DR_IP", "family": 4}
Connecting {"address": "PRIMARY_IP", "family": 4}
onConnect: "DR_IP"
Sending Pre Login: "DR_IP"
onError: "DR_IP"
socketError this.state: {"name": "SentPrelogin", "events": {}} socketError error: {"errno":-104, "code": "ECONNRESET", "syscall": "read" }
2023-06-10 13:23:53 TypeormDatabaseLogger info: warn
"MSSQL pool raised an error. ConnectionError: Connection lost - read ECONNRESET"
2023-06-10 13:23:53 TypeormDatabaseLogger error: QueryFailedError: ConnectionError: Connection lost - read ECONNRESET
"trace": {
  "Query": "select 1"
}

My questions are:

Am i missing some config options?
When multiSubnetFailover is set to true, does tedious check and reject an offline subnet IP?
If not, does all applications using tedious need to implement this check themselves?

Versions:
Typeorm: 0.3.12
Mssql: 7.3.0
Tedious: ^11.4.0

Config

{
    name: 'MY_DB_CONN',
    type: 'mssql',
    host: config.hostUrl,
    port: config.port,
    username: config.username,
    password: config.password,
    database: config.database,
    keepConnectionAlive: false,
    requestTimeout: 3 * 15000,
    pool: {
      min: 1,
      max: 10,
    },
    options: {
      encrypt: true,
      trustServerCertificate: true,
      multiSubnetFailover: true,
      useUTC: true,
    },
}

Relevant Issues and Pull Requests

Malcolm Stewart · Answer 1 · Wed Jul 12 2023 02:37:47 GMT+0800 (China Standard Time)

I was using a different driver, but the symptom sounds similar.

I have seen this issue when a smart device, such as F5 or similar, answers the SYN packet going to the secondary subnet. In this case, the driver will get fooled as to which subnet connection has the active node and will attempt to connect to the inactive node. However, once the PreLogin packet is emitted, the device tries to contact the back-end database and fails.

The case I had was intermittent and the F5 device was configured to detect SYN attacks and once the number of SYN packets in the inactive subnet reached a certain threshold, it would start answering them, and then, later, it would stop answering them for a while.

I was able to replicate it with TELNET and using the inactive IP address. For a few minutes, it would die a normal death and then for another few minutes, it would open up as if it was connected to the back-end. You can see the response packets in a network trace.

Michael Sun · Answer 2 · Wed Jul 12 2023 03:21:49 GMT+0800 (China Standard Time)

Hi @ml-rex , Thanks for raising this and the detailed explanation. As for you question, I will try my best to answer them:
Am I missing some config options?
I do not think this is any additional config related to this.

When multiSubnetFailover is set to true, does tedious check and reject an offline subnet IP?
What current inside tedious is, when mutisubnetfailover is set to true, tedious will from connections in parallel for all address that returned by dns.lookup. From behavior that you explained, seems this dns.lookup will return all the address that associate to the host no matter what their status. I tried but failed find anything concrete that explained whether IP status mattes for this function's returned addresses. On tedious side, the logic will try to connect to all the address returned, the failed the connection to offline IP hence the returned socket error.

If not, does all applications using tedious need to implement this check themselves?
Unfortunately, current tedious logic can only fail the connection after try to connected it, and reject it if there is a socket error. We can definitely do some investigation, see if there is possibility to filter out IP address by their status, and simplified this process.

Hi @arthurschreiber , am I correct about the dns.lookup returns all the address no matter of their online/ offline status? Do you aware of any way that we can look up the address but filter out the offline IPs?

ml-rex · Answer 3 · Wed Jul 12 2023 14:54:10 GMT+0800 (China Standard Time)

I was using a different driver, but the symptom sounds similar.

I have seen this issue when a smart device, such as F5 or similar, answers the SYN packet going to the secondary subnet. In this case, the driver will get fooled as to which subnet connection has the active node and will attempt to connect to the inactive node. However, once the PreLogin packet is emitted, the device tries to contact the back-end database and fails.

The case I had was intermittent and the F5 device was configured to detect SYN attacks and once the number of SYN packets in the inactive subnet reached a certain threshold, it would start answering them, and then, later, it would stop answering them for a while.

I was able to replicate it with TELNET and using the inactive IP address. For a few minutes, it would die a normal death and then for another few minutes, it would open up as if it was connected to the back-end. You can see the response packets in a network trace.

Thanks for the sharing.
From what you mentioned, seems like we have to handle the inactive node connection check on top of the tedious package.

Can you also share what driver you were using when you experience the issue?

ml-rex · Answer 4 · Wed Jul 12 2023 15:08:55 GMT+0800 (China Standard Time)

Hi @ml-rex , Thanks for raising this and the detailed explanation. As for you question, I will try my best to answer them: Am I missing some config options? I do not think this is any additional config related to this.

When multiSubnetFailover is set to true, does tedious check and reject an offline subnet IP? What current inside tedious is, when mutisubnetfailover is set to true, tedious will from connections in parallel for all address that returned by dns.lookup. From behavior that you explained, seems this dns.lookup will return all the address that associate to the host no matter what their status. I tried but failed find anything concrete that explained whether IP status mattes for this function's returned addresses. On tedious side, the logic will try to connect to all the address returned, the failed the connection to offline IP hence the returned socket error.

If not, does all applications using tedious need to implement this check themselves? Unfortunately, current tedious logic can only fail the connection after try to connected it, and reject it if there is a socket error. We can definitely do some investigation, see if there is possibility to filter out IP address by their status, and simplified this process.

Hi @arthurschreiber , am I correct about the dns.lookup returns all the address no matter of their online/ offline status? Do you aware of any way that we can look up the address but filter out the offline IPs?

Thanks Michael for the answer.
Sounds like what i observed is the expected behavior of tedious design.

To supplement,
In my case, i do see the node-mssql driver is doing another connection validation before pushing into the connection pool.
https://github.com/tediousjs/node-mssql/blob/7248e58ff223b2369cb1570005d54e9196c904bf/lib/base/connection-pool.js#L379
https://github.com/tediousjs/node-mssql/blob/7248e58ff223b2369cb1570005d54e9196c904bf/lib/tedious/connection-pool.js#L104
However, i'm still unsure why an unhealthy connection was being picked to handle a request.

Malcolm Stewart · Answer 5 · Wed Jul 12 2023 22:22:32 GMT+0800 (China Standard Time)

Hi @ml-rex, the driver does not matter. In my case it was the .NET SqlClient driver. The way that multi-subnet works is that the DNS request will return multiple IP addresses in a random order. The primary server maps the IP address for its subnet to the MAC address of its NIC card. The secondary releases its IP address so it is not connected to anything. If the driver connected to the secondary IP address first, e.g. when not using Mulitsubnet failover, then it would normally take 21 seconds to get an error from the network. MSF overcomes this by connecting to both/all IP addresses in parallel and assumes the primary will respond in a few ms and the secondary won't respond but will error out later. Once a response is made, it cancels the other connection attempt and uses the first connection. This works really well. But in the case I experienced, a network device thwarted the connection assumptions. It's generally better to identify and remove the device doing this rather than try to predict which IP address should be connected to. Your code potentially could be subject to the same "spoofing" from the device.