scylladb / gocql

Package gocql implements a fast and robust ScyllaDB client for the Go programming language.

Home Page:https://docs.scylladb.com/stable/using-scylla/drivers/cql-drivers/scylla-go-driver.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A warning message when a connection to a shard-aware port times out is bogus

vladzcloudius opened this issue · comments

What version of Scylla or Cassandra are you using?

2022.2.6

What version of Gocql are you using?

HEAD: e38b2bc

What version of Go are you using?

Irrelevant

What did you do?

Tried to create a new connection to the cluster using a shard-aware port.

What did you expect to see?

An error message that would not confuse me.

What did you see instead?

A very confusing message that made me check a totally irrelevant direction and waste more than 3 working days of multiple people till we were finally able to figure out what the problem was.


If you are having connectivity related issues please share the following additional information

Describe your Cassandra cluster

"Cassandra cluster"?! You really want to fix your GH templates ;)

please provide the following information

  • output of nodetool status

Can't do! Production system!
Single DC, 36 nodes, 3 racks.
Each rack has 12 nodes.

  • output of SELECT peer, rpc_address FROM system.peers
  • rebuild your application with the gocql_debug tag and post the output

Both the above are unfeasible.

Description
The error message in question is this:

xxxx/xx/xx xx:xx:xx scylla: a.b.c.d:19042 connection to shard-aware address a.b.c.d:19042 resulted in wrong shard being assigned; please check that you are not behind a NAT or AddressTranslater which changes source ports; falling back to non-shard-aware port for 5m0s

But the thing is that NAT or an AddressTranslator is not the only possibility here.
Given gocql#1701 it's very easy to hit a ConnectTimeout which defaults to 600ms (!!!).

As a result if one of the shards (shard A) is overloaded and a TCP connection to 19042 times out due to that the driver is going to fall back to a "storm" connection policy trying to connect to a non-shard-aware port (9042): https://github.com/scylladb/gocql/blob/master/scylla.go#L422

And then it gets interesting (which also took us some time to figure after we realized that NAT has nothing to do with this): because the driver creates most of TCP connections asynchronously: https://github.com/scylladb/gocql/blob/master/connectionpool.go#L484
the following race may happen:

  1. A connection to Shard A using 19042 is sent.
  2. A connection to Shard B using 19042 is sent.
  3. (2) times out and send a connection to 9042.
  4. (3) lands on shard A and succeeds.
  5. (1) completes but hits (https://github.com/scylladb/gocql/blob/master/scylla.go#L408) and prints the aforementioned message blaming NAT.

So, either fix the message or fix the race.

I'm going to file a separate GH issue about this fallback for a "storm" connection policy in general.